Search Results: "mats"

5 September 2021

Reproducible Builds: Reproducible Builds in August 2021

Welcome to the latest report from the Reproducible Builds project. In this post, we round up the important things that happened in the world of reproducible builds in August 2021. As always, if you are interested in contributing to the project, please visit the Contribute page on our website.
There were a large number of talks related to reproducible builds at DebConf21 this year, the 21st annual conference of the Debian Linux distribution (full schedule):
PackagingCon (@PackagingCon) is new conference for developers of package management software as well as their related communities and stakeholders. The virtual event, which is scheduled to take place on the 9th and 10th November 2021, has a mission is to bring different ecosystems together: from Python s pip to Rust s cargo to Julia s Pkg, from Debian apt over Nix to conda and mamba, and from vcpkg to Spack we hope to have many different approaches to package management at the conference . A number of people from reproducible builds community are planning on attending this new conference, and some may even present. Tickets start at $20 USD.
As reported in our May report, the president of the United States signed an executive order outlining policies aimed to improve the cybersecurity in the US. The executive order comes after a number of highly-publicised security problems such as a ransomware attack that affected an oil pipeline between Texas and New York and the SolarWinds hack that affected a large number of US federal agencies. As a followup this month, however, a detailed fact sheet was released announcing a number large-scale initiatives and that will undoubtedly be related to software supply chain security and, as a result, reproducible builds.
Lastly, We ran another productive meeting on IRC in August (original announcement) which ran for just short of two hours. A full set of notes from the meeting is available.

Software development kpcyrd announced an interesting new project this month called I probably didn t backdoor this which is an attempt to be:
a practical attempt at shipping a program and having reasonably solid evidence there s probably no backdoor. All source code is annotated and there are instructions explaining how to use reproducible builds to rebuild the artifacts distributed in this repository from source. The idea is shifting the burden of proof from you need to prove there s a backdoor to we need to prove there s probably no backdoor . This repository is less about code (we re going to try to keep code at a minimum actually) and instead contains technical writing that explains why these controls are effective and how to verify them. You are very welcome to adopt the techniques used here in your projects. ( )
As the project s README goes on the mention: the techniques used to rebuild the binary artifacts are only possible because the builds for this project are reproducible . This was also announced on our mailing list this month in a thread titled i-probably-didnt-backdoor-this: Reproducible Builds for upstreams. kpcyrd also wrote a detailed blog post about the problems surrounding Linux distributions (such as Alpine and Arch Linux) that distribute compiled Python bytecode in the form of .pyc files generated during the build process.

diffoscope diffoscope is our in-depth and content-aware diff utility. Not only can it locate and diagnose reproducibility issues, it can provide human-readable diffs from many kinds of binary formats. This month, Chris Lamb made a number of changes, including releasing version 180), version 181) and version 182) as well as the following changes:
  • New features:
    • Add support for extracting the signing block from Android APKs. [ ]
    • If we specify a suffix for a temporary file or directory within the code, ensure it starts with an underscore (ie. _ ) to make the generated filenames more human-readable. [ ]
    • Don t include short GCC lines that differ on a single prefix byte either. These are distracting, not very useful and are simply the strings(1) command s idea of the build ID, which is displayed elsewhere in the diff. [ ][ ]
    • Don t include specific .debug-like lines in the ELF-related output, as it is invariably a duplicate of the debug ID that exists better in the readelf(1) differences for this file. [ ]
  • Bug fixes:
    • Add a special case to SquashFS image extraction to not fail if we aren t the superuser. [ ]
    • Only use java -jar /path/to/apksigner.jar if we have an apksigner.jar as newer versions of apksigner in Debian use a shell wrapper script which will be rejected if passed directly to the JVM. [ ]
    • Reduce the maximum line length for calculating Wagner-Fischer, improving the speed of output generation a lot. [ ]
    • Don t require apksigner in order to compare .apk files using apktool. [ ]
    • Update calls (and tests) for the new version of odt2txt. [ ]
  • Output improvements:
    • Mention in the output if the apksigner tool is missing. [ ]
    • Profile diffoscope.diff.linediff and specialize. [ ][ ]
  • Logging improvements:
    • Format debug-level messages related to ELF sections using the diffoscope.utils.format_class. [ ]
    • Print the size of generated reports in the logs (if possible). [ ]
    • Include profiling information in --debug output if --profile is not set. [ ]
  • Codebase improvements:
    • Clarify a comment about the HUGE_TOOLS Python dictionary. [ ]
    • We can pass -f to apktool to avoid creating a strangely-named subdirectory. [ ]
    • Drop an unused File import. [ ]
    • Update the supported & minimum version of Black. [ ]
    • We don t use the logging variable in a specific place, so alias it to an underscore (ie. _ ) instead. [ ]
    • Update some various copyright years. [ ]
    • Clarify a comment. [ ]
  • Test improvements:
    • Update a test to check specific contents of SquashFS listings, otherwise it fails depending on the test systems user ID to username passwd(5) mapping. [ ]
    • Assign seen and expected values to local variables to improve contextual information in failed tests. [ ]
    • Don t print an orphan newline when the source code formatting test passes. [ ]

In addition Santiago Torres Arias added support for Squashfs version 4.5 [ ] and Felix C. Stegerman suggested a number of small improvements to the output of the new APK signing block [ ]. Lastly, Chris Lamb uploaded python-libarchive-c version 3.1-1 to Debian experimental for the new 3.x branch python-libarchive-c is used by diffoscope.

Distribution work In Debian, 68 reviews of packages were added, 33 were updated and 10 were removed this month, adding to our knowledge about identified issues. Two new issue types have been identified too: nondeterministic_ordering_in_todo_items_collected_by_doxygen and kodi_package_captures_build_path_in_source_filename_hash. kpcyrd published another monthly report on their work on reproducible builds within the Alpine and Arch Linux distributions, specifically mentioning rebuilderd, one of the components powering reproducible.archlinux.org. The report also touches on binary transparency, an important component for supply chain security. The @GuixHPC account on Twitter posted an infographic on what fraction of GNU Guix packages are bit-for-bit reproducible: Finally, Bernhard M. Wiedemann posted his monthly reproducible builds status report for openSUSE.

Upstream patches The Reproducible Builds project detects, dissects and attempts to fix as many currently-unreproducible packages as possible. We endeavour to send all of our patches upstream where appropriate. This month, we wrote a large number of such patches, including: Elsewhere, it was discovered that when supporting various new language features and APIs for Android apps, the resulting APK files that are generated now vary wildly from build to build (example diffoscope output). Happily, it appears that a patch has been committed to the relevant source tree. This was also discussed on our mailing list this month in a thread titled Android desugaring and reproducible builds started by Marcus Hoffmann.

Website and documentation There were quite a few changes to the Reproducible Builds website and documentation this month, including:
  • Felix C. Stegerman:
    • Update the website self-build process to not use the buster-backports suite now that Debian Bullseye is the stable release. [ ]
  • Holger Levsen:
    • Add a new page documenting various package rebuilder solutions. [ ]
    • Add some historical talks and slides from DebConf20. [ ][ ]
    • Various improvements to the history page. [ ][ ][ ]
    • Rename the Comparison protocol documentation category to Verification . [ ]
    • Update links to F-Droid documentation. [ ]
  • Ian Muchina:
    • Increase the font size of titles and de-emphasize event details on the talk page. [ ]
    • Rename the README file to README.md to improve the user experience when browsing the Git repository in a web browser. [ ]
  • Mattia Rizzolo:
    • Drop a position:fixed CSS statement that is negatively affecting with some width settings. [ ]
    • Fix the sizing of the elements inside the side navigation bar. [ ]
    • Show gold level sponsors and above in the sidebar. [ ]
    • Updated the documentation within reprotest to mention how ldconfig conflicts with the kernel variation. [ ]
  • Roland Clobus:
    • Added a ticket number for the issue with the live Cinnamon image and diffoscope. [ ]

Testing framework The Reproducible Builds project runs a testing framework at tests.reproducible-builds.org, to check packages and other artifacts for reproducibility. This month, the following changes were made:
  • Holger Levsen:
    • Debian-related changes:
      • Make a large number of changes to support the new Debian bookworm release, including adding it to the dashboard [ ], start scheduling tests [ ], adding suitable Apache redirects [ ] etc. [ ][ ][ ][ ][ ]
      • Make the first build use LANG=C.UTF-8 to match the official Debian build servers. [ ]
      • Only test Debian Live images once a week. [ ]
      • Upgrade all nodes to use Debian Bullseye [ ] [ ]
      • Update README documentation for the Debian Bullseye release. [ ]
    • Other changes:
      • Only include rsync output if the $DEBUG variable is enabled. [ ]
      • Don t try to install mock, a tool used to build Fedora packages some time ago. [ ]
      • Drop an unused function. [ ]
      • Various documentation improvements. [ ][ ]
      • Improve the node health check to detect zombie jobs. [ ]
  • Jessica Clarke (FreeBSD-related changes):
    • Update the location and branch name for the main FreeBSD Git repository. [ ]
    • Correctly ignore the source tarball when comparing build results. [ ]
    • Drop an outdated version number from the documentation. [ ]
  • Mattia Rizzolo:
    • Block F-Droid jobs from running whilst the setup is running. [ ]
    • Enable debugging for the rsync job related to Debian Live images. [ ]
    • Pass BUILD_TAG and BUILD_URL environment for the Debian Live jobs. [ ]
    • Refactor the master_wrapper script to use a Bash array for the parameters. [ ]
    • Prefer YAML s safe_load() function over the unsafe variant. [ ]
    • Use the correct variable in the Apache config to match possible existing files on disk. [ ]
    • Stop issuing HTTP 301 redirects for things that not actually permanent. [ ]
  • Roland Clobus (Debian live image generation):
    • Increase the diffoscope timeout from 120 to 240 minutes; the Cinnamon image should now be able to finish. [ ]
    • Use the new snapshot service. [ ]
    • Make a number of improvements to artifact handling, such as moving the artifacts to the Jenkins host [ ] and correctly cleaning them up at the right time. [ ][ ][ ]
    • Where possible, link to the Jenkins build URL that created the artifacts. [ ][ ]
    • Only allow only one job to run at the same time. [ ]
  • Vagrant Cascadian:
    • Temporarily disable armhf nodes for DebConf21. [ ][ ]

Lastly, if you are interested in contributing to the Reproducible Builds project, please visit the Contribute page on our website. You can get in touch with us via:

15 August 2021

Dirk Eddelbuettel: RcppBDT 0.2.4 on CRAN: Updates

After a seven-year break (!!), the RcppBDT packages has been updated on CRAN. The RcppBDT package is an early adopter of Rcpp and was one of the first packages utilizing Boost and its Date_Time library. The now more widely-used package anytime is a direct descentant of RcppBDT. In fact, the last time RcppBDT was released, anytime did not yet exist. And some of the changes now finally released here in this version are some of the first steps made towards what became anytime. RcppBDT is broader in scope and provides a wider range of functionality but in a somewhat rougher form as we never sat down writing higher-end wrappers in R for all the potential use cases. When we wrote the first RcppBDT versions, many other popular date/time packages were all in R code and not compiled, and this package showed how things could be done at the compiled level. Now other packages, including anytime have filled the void so fully polishing RcppBDT may never happen. In any event, this release refreshes the package and brings it to full R CMD check --as-cran compliance. The NEWS entry follows:

Changes in version 0.2.4 (2021-08-15)
  • New utility function toPOSIXct which can take multitple input format (integer, floating point or character) vectors and can convert by relying on a wide variety of standard formats. This predates what has long been split off into a new package anytime which is more functional and feaureful.
  • New demo 'toPOSIXct' illustrating the feature.
  • New demo 'toPOSIXctTiming' benchmarking it.
  • Documentation for new functions was added as well.
  • CI now uses run.sh from r-ci.
  • Functions getLastDayOfWeekInMonth and getFirstDayOfWeekInMonth now use dow argument.
  • The shared library is now registered when loaded from NAMESPACE.
  • C level entry points are now registered as R now recommends.
  • Several badges were added to the README.md file.
  • Several fields were added to the DESCRIPTION file, and/or updated.
  • Documentation URLs where both updated as needed and converted to https.

Courtesy of my CRANberries, there is also a diffstat report for this release. If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

2 August 2021

Colin Watson: Launchpad now runs on Python 3!

After a very long porting journey, Launchpad is finally running on Python 3 across all of our systems. I wanted to take a bit of time to reflect on why my emotional responses to this port differ so much from those of some others who ve done large ports, such as the Mercurial maintainers. It s hard to deny that we ve had to burn a lot of time on this, which I m sure has had an opportunity cost, and from one point of view it s essentially running to stand still: there is no single compelling feature that we get solely by porting to Python 3, although it s clearly a prerequisite for tidying up old compatibility code and being able to use modern language facilities in the future. And yet, on the whole, I found this a rewarding project and enjoyed doing it. Some of this may be because by inclination I m a maintenance programmer and actually enjoy this sort of thing. My default view tends to be that software version upgrades may be a pain but it s much better to get that pain over with as soon as you can rather than trying to hold back the tide; you can certainly get involved and try to shape where things end up, but rightly or wrongly I can t think of many cases when a righteously indignant user base managed to arrange for the old version to be maintained in perpetuity so that they never had to deal with the new thing (OK, maybe Perl 5 counts here). I think a more compelling difference between Launchpad and Mercurial, though, may be that very few other people really had a vested interest in what Python version Launchpad happened to be running, because it s all server-side code (aside from some client libraries such as launchpadlib, which were ported years ago). As such, we weren t trying to do this with the internet having Strong Opinions at us. We were doing this because it was obviously the only long-term-maintainable path forward, and in more recent times because some of our library dependencies were starting to drop support for Python 2 and so it was obviously going to become a practical problem for us sooner or later; but if we d just stayed on Python 2 forever then fundamentally hardly anyone else would really have cared directly, only maybe about some indirect consequences of that. I don t follow Mercurial development so I may be entirely off-base, but if other people were yelling at me about how late my project was to finish its port, that in itself would make me feel more negatively about the project even if I thought it was a good idea. Having most of the pressure come from ourselves rather than from outside meant that wasn t an issue for us. I m somewhat inclined to think of the process as an extreme version of paying down technical debt. Moving from Python 2.7 to 3.5, as we just did, means skipping over multiple language versions in one go, and if similar changes had been made more gradually it would probably have felt a lot more like the typical dependency update treadmill. I appreciate why not everyone might want to think of it this way: maybe this is just my own rationalization. Reflections on porting to Python 3 I m not going to defend the Python 3 migration process; it was pretty rough in a lot of ways. Nor am I going to spend much effort relitigating it here, as it s already been done to death elsewhere, and as I understand it the core Python developers have got the message loud and clear by now. At a bare minimum, a lot of valuable time was lost early in Python 3 s lifetime hanging on to flag-day-type porting strategies that were impractical for large projects, when it should have been providing for bilingual strategies (code that runs in both Python 2 and 3 for a transitional period) which is where most libraries and most large migrations ended up in practice. For instance, the early advice to library maintainers to maintain two parallel versions or perhaps translate dynamically with 2to3 was entirely impractical in most non-trivial cases and wasn t what most people ended up doing, and yet the idea that 2to3 is all you need still floats around Stack Overflow and the like as a result. (These days, I would probably point people towards something more like Eevee s porting FAQ as somewhere to start.) There are various fairly straightforward things that people often suggest could have been done to smooth the path, and I largely agree: not removing the u'' string prefix only to put it back in 3.3, fewer gratuitous compatibility breaks in the name of tidiness, and so on. But if I had a time machine, the number one thing I would ask to have been done differently would be introducing type annotations in Python 2 before Python 3 branched off. It s true that it s technically possible to do type annotations in Python 2, but the fact that it s a different syntax that would have to be fixed later is offputting, and in practice it wasn t widely used in Python 2 code. To make a significant difference to the ease of porting, annotations would need to have been introduced early enough that lots of Python 2 library code used them so that porting code didn t have to be quite so much of an exercise of manually figuring out the exact nature of string types from context. Launchpad is a complex piece of software that interacts with multiple domains: for example, it deals with a database, HTTP, web page rendering, Debian-format archive publishing, and multiple revision control systems, and there s often overlap between domains. Each of these tends to imply different kinds of string handling. Web page rendering is normally done mainly in Unicode, converting to bytes as late as possible; revision control systems normally want to spend most of their time working with bytes, although the exact details vary; HTTP is of course bytes on the wire, but Python s WSGI interface has some string type subtleties. In practice I found myself thinking about at least four string-like types (that is, things that in a language with a stricter type system I might well want to define as distinct types and restrict conversion between them): bytes, text, ordinary native strings (str in either language, encoded to UTF-8 in Python 2), and native strings with WSGI s encoding rules. Some of these are emergent properties of writing in the intersection of Python 2 and 3, which is effectively a specialized language of its own without coherent official documentation whose users must intuit its behaviour by comparing multiple sources of information, or by referring to unofficial porting guides: not a very satisfactory situation. Fortunately much of the complexity collapses once it becomes possible to write solely in Python 3. Some of the difficulties we ran into are not ones that are typically thought of as Python 2-to-3 porting issues, because they were changed later in Python 3 s development process. For instance, the email module was substantially improved in around the 3.2/3.3 timeframe to handle Python 3 s bytes/text model more correctly, and since Launchpad sends quite a few different kinds of email messages and has some quite picky tests for exactly what it emits, this entailed a lot of work in our email sending code and in our test suite to account for that. (It took me a while to work out whether we should be treating raw email messages as bytes or as text; bytes turned out to work best.) 3.4 made some tweaks to the implementation of quoted-printable encoding that broke a number of our tests in ways that took some effort to fix, because the tests needed to work on both 2.7 and 3.5. The list goes on. I got quite proficient at digging through Python s git history to figure out when and why some particular bit of behaviour had changed. One of the thorniest problems was parsing HTTP form data. We mainly rely on zope.publisher for this, which in turn relied on cgi.FieldStorage; but cgi.FieldStorage is badly broken in some situations on Python 3. Even if that bug were fixed in a more recent version of Python, we can t easily use anything newer than 3.5 for the first stage of our port due to the version of the base OS we re currently running, so it wouldn t help much. In the end I fixed some minor issues in the multipart module (and was kindly given co-maintenance of it) and converted zope.publisher to use it. Although this took a while to sort out, it seems to have gone very well. A couple of other interesting late-arriving issues were around pickle. For most things we normally prefer safer formats such as JSON, but there are a few cases where we use pickle, particularly for our session databases. One of my colleagues pointed out that I needed to remember to tell pickle to stick to protocol 2, so that we d be able to switch back and forward between Python 2 and 3 for a while; quite right, and we later ran into a similar problem with marshal too. A more surprising problem was that datetime.datetime objects pickled on Python 2 require special care when unpickling on Python 3; rather than the approach that ended up being implemented and documented for Python 3.6, though, I preferred a custom unpickler, both so that things would work on Python 3.5 and so that I wouldn t have to risk affecting the decoding of other pickled strings in the session database. General lessons Writing this over a year after Python 2 s end-of-life date, and certainly nowhere near the leading edge of Python 3 porting work, it s perhaps more useful to look at this in terms of the lessons it has for other large technical debt projects. I mentioned in my previous article that I used the approach of an enormous and frequently-rebased git branch as a working area for the port, committing often and sometimes combining and extracting commits for review once they seemed to be ready. A port of this scale would have been entirely intractable without a tool of similar power to git rebase, so I m very glad that we finished migrating to git in 2019. I relied on this right up to the end of the port, and it also allowed for quick assessments of how much more there was to land. git worktree was also helpful, in that I could easily maintain working trees built for each of Python 2 and 3 for comparison. As is usual for most multi-developer projects, all changes to Launchpad need to go through code review, although we sometimes make exceptions for very simple and obvious changes that can be self-reviewed. Since I knew from the outset that this was going to generate a lot of changes for review, I therefore structured my work from the outset to try to make it as easy as possible for my colleagues to review it. This generally involved keeping most changes to a somewhat manageable size of 800 lines or less (although this wasn t always possible), and arranging commits mainly according to the kind of change they made rather than their location. For example, when I needed to fix issues with / in Python 3 being true division rather than floor division, I did so in one commit across the various places where it mattered and took care not to mix it with other unrelated changes. This is good practice for nearly any kind of development, but it was especially important here since it allowed reviewers to consider a clear explanation of what I was doing in the commit message and then skim-read the rest of it much more quickly. It was vital to keep the codebase in a working state at all times, and deploy to production reasonably often: this way if something went wrong the amount of code we had to debug to figure out what had happened was always tractable. (Although I can t seem to find it now to link to it, I saw an account a while back of a company that had taken a flag-day approach instead with a large codebase. It seemed to work for them, but I m certain we couldn t have made it work for Launchpad.) I can t speak too highly of Launchpad s test suite, much of which originated before my time. Without a great deal of extensive coverage of all sorts of interesting edge cases at both the unit and functional level, and a corresponding culture of maintaining that test suite well when making new changes, it would have been impossible to be anything like as confident of the port as we were. As part of the porting work, we split out a couple of substantial chunks of the Launchpad codebase that could easily be decoupled from the core: its Mailman integration and its code import worker. Both of these had substantial dependencies with complex requirements for porting to Python 3, and arranging to be able to do these separately on their own schedule was absolutely worth it. Like disentangling balls of wool, any opportunity you can take to make things less tightly-coupled is probably going to make it easier to disentangle the rest. (I can see a tractable way forward to porting the code import worker, so we may well get that done soon. Our Mailman integration will need to be rewritten, though, since it currently depends on the Python-2-only Mailman 2, and Mailman 3 has a different architecture.) Python lessons Our database layer was already in pretty good shape for a port, since at least the modern bits of its table modelling interface were already strict about using Unicode for text columns. If you have any kind of pervasive low-level framework like this, then making it be pedantic at you in advance of a Python 3 port will probably incur much less swearing in the long run, as you won t be trying to deal with quite so many bytes/text issues at the same time as everything else. Early in our port, we established a standard set of __future__ imports and started incrementally converting files over to them, mainly because we weren t yet sure what else to do and it seemed likely to be helpful. absolute_import was definitely reasonable (and not often a problem in our code), and print_function was annoying but necessary. In hindsight I m not sure about unicode_literals, though. For files that only deal with bytes and text it was reasonable enough, but as I mentioned above there were also a number of cases where we needed literals of the language s native str type, i.e. bytes in Python 2 and text in Python 3: this was particularly noticeable in WSGI contexts, but also cropped up in some other surprising places. We generally either omitted unicode_literals or used six.ensure_str in such cases, but it was definitely a bit awkward and maybe I should have listened more to people telling me it might be a bad idea. A lot of Launchpad s early tests used doctest, mainly in the style where you have text files that interleave narrative commentary with examples. The development team later reached consensus that this was best avoided in most cases, but by then there were far too many doctests to conveniently rewrite in some other form. Porting doctests to Python 3 is really annoying. You run into all the little changes in how objects are represented as text (particularly u'...' versus '...', but plenty of other cases as well); you have next to no tools to do anything useful like skipping individual bits of a doctest that don t apply; using __future__ imports requires the rather obscure approach of adding the relevant names to the doctest s globals in the relevant DocFileSuite or DocTestSuite; dealing with many exception tracebacks requires something like zope.testing.renormalizing; and whatever code refactoring tools you re using probably don t work properly. Basically, don t have done that. It did all turn out to be tractable for us in the end, and I managed to avoid using much in the way of fragile doctest extensions aside from the aforementioned zope.testing.renormalizing, but it was not an enjoyable experience. Regressions I know of nine regressions that reached Launchpad s production systems as a result of this porting work; of course there were various other regressions caught by CI or in manual testing. (Considering the size of this project, I count it as a resounding success that there were only nine production issues, and that for the most part we were able to fix them quickly.) Equality testing of removed database objects One of the things we had to do while porting to Python 3 was to implement the __eq__, __ne__, and __hash__ special methods for all our database objects. This was quite conceptually fiddly, because doing this requires knowing each object s primary key, and that may not yet be available if we ve created an object in Python but not yet flushed the actual INSERT statement to the database (most of our primary keys are auto-incrementing sequences). We thus had to take care to flush pending SQL statements in such cases in order to ensure that we know the primary keys. However, it s possible to have a problem at the other end of the object lifecycle: that is, a Python object might still be reachable in memory even though the underlying row has been DELETEd from the database. In most cases we don t keep removed objects around for obvious reasons, but it can happen in caching code, and buildd-manager crashed as a result (in fact while it was still running on Python 2). We had to take extra care to avoid this problem. Debian imports crashed on non-UTF-8 filenames Python 2 has some unfortunate behaviour around passing bytes or Unicode strings (depending on the platform) to shutil.rmtree, and the combination of some porting work and a particular source package in Debian that contained a non-UTF-8 file name caused us to run into this. The fix was to ensure that the argument passed to shutil.rmtree is a str regardless of Python version. We d actually run into something similar before: it s a subtle porting gotcha, since it s quite easy to end up passing Unicode strings to shutil.rmtree if you re in the process of porting your code to Python 3, and you might easily not notice if the file names in your tests are all encoded using UTF-8. lazr.restful ETags We eventually got far enough along that we could switch one of our four appserver machines (we have quite a number of other machines too, but the appservers handle web and API requests) to Python 3 and see what happened. By this point our extensive test suite had shaken out the vast majority of the things that could go wrong, but there was always going to be room for some interesting edge cases. One of the Ubuntu kernel team reported that they were seeing an increase in 412 Precondition Failed errors in some of their scripts that use our webservice API. These can happen when you re trying to modify an existing resource: the underlying protocol involves sending an If-Match header with the ETag that the client thinks the resource has, and if this doesn t match the ETag that the server calculates for the resource then the client has to refresh its copy of the resource and try again. We initially thought that this might be legitimate since it can happen in normal operation if you collide with another client making changes to the same resource, but it soon became clear that something stranger was going on: we were getting inconsistent ETags for the same object even when it was unchanged. Since we d recently switched a quarter of our appservers to Python 3, that was a natural suspect. Our lazr.restful package provides the framework for our webservice API, and roughly speaking it generates ETags by serializing objects into some kind of canonical form and hashing the result. Unfortunately the serialization was dependent on the Python version in a few ways, and in particular it serialized lists of strings such as lists of bug tags differently: Python 2 used [u'foo', u'bar', u'baz'] where Python 3 used ['foo', 'bar', 'baz']. In lazr.restful 1.0.3 we switched to using JSON for this, removing the Python version dependency and ensuring consistent behaviour between appservers. Memory leaks This problem took the longest to solve. We noticed fairly quickly from our graphs that the appserver machine we d switched to Python 3 had a serious memory leak. Our appservers had always been a bit leaky, but now it wasn t so much a small hole that we can bail occasionally as the boat is sinking rapidly : A serious memory leak (Yes, this got in the way of working out what was going on with ETags for a while.) I spent ages messing around with various attempts to fix this. Since only a quarter of our appservers were affected, and we could get by on 75% capacity for a while, it wasn t urgent but it was definitely annoying. After spending some quality time with objgraph, for some time I thought traceback reference cycles might be at fault, and I sent a number of fixes to various upstream projects for those (e.g. zope.pagetemplate). Those didn t help the leaks much though, and after a while it became clear to me that this couldn t be the sole problem: Python has a cyclic garbage collector that will eventually collect reference cycles as long as there are no strong references to any objects in them, although it might not happen very quickly. Something else must be going on. Debugging reference leaks in any non-trivial and long-running Python program is extremely arduous, especially with ORMs that naturally tend to end up with lots of cycles and caches. After a while I formed a hypothesis that zope.server might be keeping a strong reference to something, although I never managed to nail it down more firmly than that. This was an attractive theory as we were already in the process of migrating to Gunicorn for other reasons anyway, and Gunicorn also has a convenient max_requests setting that s good at mitigating memory leaks. Getting this all in place took some time, but once we did we found that everything was much more stable: A rather flat memory graph This isn t completely satisfying as we never quite got to the bottom of the leak itself, and it s entirely possible that we ve only papered over it using max_requests: I expect we ll gradually back off on how frequently we restart workers over time to try to track this down. However, pragmatically, it s no longer an operational concern. Mirror prober HTTPS proxy handling After we switched our script servers to Python 3, we had several reports of mirror probing failures. (Launchpad keeps lists of Ubuntu archive and image mirrors, and probes them every so often to check that they re reasonably complete and up to date.) This only affected HTTPS mirrors when probed via a proxy server, support for which is a relatively recent feature in Launchpad and involved some code that we never managed to unit-test properly: of course this is exactly the code that went wrong. Sadly I wasn t able to sort out that gap, but at least the fix was simple. Non-MIME-encoded email headers As I mentioned above, there were substantial changes in the email package between Python 2 and 3, and indeed between minor versions of Python 3. Our test coverage here is pretty good, but it s an area where it s very easy to have gaps. We noticed that a script that processes incoming email was crashing on messages with headers that were non-ASCII but not MIME-encoded (and indeed then crashing again when it tried to send a notification of the crash!). The only examples of these I looked at were spam, but we still didn t want to crash on them. The fix involved being somewhat more careful about both the handling of headers returned by Python s email parser and the building of outgoing email notifications. This seems to be working well so far, although I wouldn t be surprised to find the odd other incorrect detail in this sort of area. Failure to handle non-ISO-8859-1 URL-encoded form input Remember how I said that parsing HTTP form data was thorny? After we finished upgrading all our appservers to Python 3, people started reporting that they couldn t post Unicode comments to bugs, which turned out to be only if the attempt was made using JavaScript, and was because I hadn t quite managed to get URL-encoded form data working properly with zope.publisher and multipart. The current standard describes the URL-encoded format for form data as in many ways an aberrant monstrosity , so this was no great surprise. Part of the problem was some very strange choices in zope.publisher dating back to 2004 or earlier, which I attempted to clean up and simplify. The rest was that Python 2 s urlparse.parse_qs unconditionally decodes percent-encoded sequences as ISO-8859-1 if they re passed in as part of a Unicode string, so multipart needs to work around this on Python 2. I m still not completely confident that this is correct in all situations, but at least now that we re on Python 3 everywhere the matrix of cases we need to care about is smaller. Inconsistent marshalling of Loggerhead s disk cache We use Loggerhead for providing web browsing of Bazaar branches. When we upgraded one of its two servers to Python 3, we immediately noticed that the one still on Python 2 was failing to read back its revision information cache, which it stores in a database on disk. (We noticed this because it caused a deployment to fail: when we tried to roll out new code to the instance still on Python 2, Nagios checks had already caused an incompatible cache to be written for one branch from the Python 3 instance.) This turned out to be a similar problem to the pickle issue mentioned above, except this one was with marshal, which I didn t think to look for because it s a relatively obscure module mostly used for internal purposes by Python itself; I m not sure that Loggerhead should really be using it in the first place. The fix was relatively straightforward, complicated mainly by now needing to cope with throwing away unreadable cache data. Ironically, if we d just gone ahead and taken the nominally riskier path of upgrading both servers at the same time, we might never have had a problem here. Intermittent bzr failures Finally, after we upgraded one of our two Bazaar codehosting servers to Python 3, we had a report of intermittent bzr branch hangs. After some digging I found this in our logs:
Traceback (most recent call last):
  ...
  File "/srv/bazaar.launchpad.net/production/codehosting1-rev-20124175fa98fcb4b43973265a1561174418f4bd/env/lib/python3.5/site-packages/twisted/conch/ssh/channel.py", line 136, in addWindowBytes
    self.startWriting()
  File "/srv/bazaar.launchpad.net/production/codehosting1-rev-20124175fa98fcb4b43973265a1561174418f4bd/env/lib/python3.5/site-packages/lazr/sshserver/session.py", line 88, in startWriting
    resumeProducing()
  File "/srv/bazaar.launchpad.net/production/codehosting1-rev-20124175fa98fcb4b43973265a1561174418f4bd/env/lib/python3.5/site-packages/twisted/internet/process.py", line 894, in resumeProducing
    for p in self.pipes.itervalues():
builtins.AttributeError: 'dict' object has no attribute 'itervalues'
I d seen this before in our git hosting service: it was a bug in Twisted s Python 3 port, fixed after 20.3.0 but unfortunately after the last release that supported Python 2, so we had to backport that patch. Using the same backport dealt with this. Onwards!

31 July 2021

Chris Lamb: Free software activities in July 2021

Here is my monthly update covering what I have been doing in the free software world during July 2021 (previous month): As part of my role of being the assistant Secretary of the Open Source Initiative and a board director of Software in the Public Interest I attended their respective monthly meetings. As outlined in last months posts, however, my term on the OSI board has been slightly extended due to the discovery of a vulnerability in OSI's recent election as a result, the 2021 election is currently being re-run. Reproducible Builds One of the original promises of open source software is that distributed peer review and transparency of process results in enhanced end-user security. However, whilst anyone may inspect the source code of free and open source software for malicious flaws, almost all software today is distributed as pre-compiled binaries. This allows nefarious third parties to compromise systems by injecting malicious code into ostensibly secure software during the various compilation and distribution processes. The motivation behind the Reproducible Builds effort is to ensure no flaws have been introduced during this compilation process by promising identical results are always generated from a given source, thus allowing multiple third-parties to come to a consensus on whether a build was compromised. This month, I: I also made the following changes to diffoscope, including preparing and uploading versions 178 and 179 to PyPI and Debian: Debian Bugs filed Uploads Debian LTS This month I have worked 18 hours on Debian Long Term Support (LTS) and 12 hours on its sister Extended LTS project. You can find out more about the project via the following video:

30 June 2021

Aigars Mahinovs: Keeping it as simple as possible

You know that you've had the same server too long when they discontinue the entire class of servers you were using and you need to migrate to a new instance. And you know you've not done anything with that server (and the blog running on it) for too long when you have no idea how that thing is actually working. Its a good opportunity to start over from scratch, and a good motivation to the new thing as simply as humanly possible, or even simpler. So I am switching to a statically generated blog as well. Not sure what took me so long, but thank good the tooling has really improved since the last time I looked. It was as simple as picking Nikola, finding its import_feed plugin, changing the BLOG_RSS_LIMIT in my Django Mezzanine blog to a thousand (to export all posts via RSS/ATOM feed), fixing some bugs in the import_feed plugin, waiting a few minutes for the full feed to generate and to be imported, adjusting the config of the resulting site, posting that to git and writing a simple shell script to pull that repo periodically and call nikola build on it, as well as config to serve ther result via ngnix. Done. After that creating a new blog post is just nikola new_post and editing it in vim and pushing to git. I prefer Markdown, but it supports all kinds of formats. And the old posts are just stored as HTML. Really simple. I think I will spend more time fighting with Google to allow me to forward email from my domain to my GMail postbox without it refusing all of it as spam.

10 June 2021

Vincent Bernat: Serving WebP & AVIF images with Nginx

WebP and AVIF are two image formats for the web. They aim to produce smaller files than JPEG and PNG. They both support lossy and lossless compression, as well as alpha transparency. WebP was developed by Google and is a derivative of the VP8 video format.1 It is supported on most browsers. AVIF is using the newer AV1 video format to achieve better results. It is supported by Chromium-based browsers and has experimental support for Firefox.2

Your browser supports WebP and AVIF image formats. Your browser supports none of these image formats. Your browser only supports the WebP image format. Your browser only supports the AVIF image format.

Without JavaScript, I can t tell what your browser supports.

Converting and optimizing images For this blog, I am using the following shell snippets to convert and optimize JPEG and PNG images. Skip to the next section if you are only interested in the Nginx setup.

JPEG images JPEG images are converted to WebP using cwebp.
find media/images -type f -name '*.jpg' -print0 \
    xargs -0n1 -P$(nproc) -i \
      cwebp -q 84 -af ' ' -o ' '.webp
They are converted to AVIF using avifenc from libavif:
find media/images -type f -name '*.jpg' -print0 \
    xargs -0n1 -P$(nproc) -i \
      avifenc --codec aom --yuv 420 --min 20 --max 25 ' ' ' '.avif
Then, they are optimized using jpegoptim built with Mozilla s improved JPEG encoder, via Nix. This is one reason I love Nix.
jpegoptim=$(nix-build --no-out-link \
      -E 'with (import <nixpkgs> ); jpegoptim.override   libjpeg = mozjpeg;  ')
find media/images -type f -name '*.jpg' -print0 \
    sort -z
    xargs -0n10 -P$(nproc) \
      $ jpegoptim /bin/jpegoptim --max=84 --all-progressive --strip-all

PNG images PNG images are down-sampled to 8-bit RGBA-palette using pngquant. The conversion reduces file sizes significantly while being mostly invisible.
find media/images -type f -name '*.png' -print0 \
    sort -z
    xargs -0n10 -P$(nproc) \
      pngquant --skip-if-larger --strip \
               --quiet --ext .png --force
Then, they are converted to WebP with cwebp in lossless mode:
find media/images -type f -name '*.png' -print0 \
    xargs -0n1 -P$(nproc) -i \
      cwebp -z 8 ' ' -o ' '.webp
No conversion is done to AVIF: lossless compression is not as efficient as pngquant and lossy compression is only marginally better than what I get with WebP.

Keeping only the smallest files I am only keeping WebP and AVIF images if they are at least 10% smaller than the original format: decoding is usually faster for JPEG and PNG; and JPEG images can be decoded progressively.3
for f in media/images/**/*. webp,avif ; do
  orig=$(stat --format %s $ f%.* )
  new=$(stat --format %s $f)
  (( orig*0.90 > new ))   rm $f
done
I only keep AVIF images if they are smaller than WebP.
for f in media/images/**/*.avif; do
  [[ -f $ f%.* .webp ]]   continue
  orig=$(stat --format %s $ f%.* .webp)
  new=$(stat --format %s $f)
  (( $orig > $new ))   rm $f
done
We can compare how many images are kept when converted to WebP or AVIF:
printf "     %10s %10s %10s\n" Original WebP AVIF
for format in png jpg; do
  printf " $ format:u  %10s %10s %10s\n" \
    $(find media/images -name "*.$format"   wc -l) \
    $(find media/images -name "*.$format.webp"   wc -l) \
    $(find media/images -name "*.$format.avif"   wc -l)
done
AVIF is better than MozJPEG for most JPEG files while WebP beats MozJPEG only for one file out of two:
       Original       WebP       AVIF
 PNG         64         47          0
 JPG         83         40         74

Further reading I didn t detail my choices for quality parameters and there is not much science in it. Here are two resources providing more insight on AVIF:

Serving WebP & AVIF with Nginx To serve WebP and AVIF images, there are two possibilities:
  1. use <picture> to let the browser pick the format it supports, or
  2. use content negotiation to let the server send the best-supported format.
I use the second approach. It relies on inspecting the Accept HTTP header in the request. For Chrome, it looks like this:
Accept: image/avif,image/webp,image/apng,image/*,*/*;q=0.8
I configure Nginx to serve AVIF image, then the WebP image, and fallback to the original JPEG/PNG image depending on what the browser advertises:4
http  
  map $http_accept $webp_suffix  
    default        "";
    "~image/webp"  ".webp";
   
  map $http_accept $avif_suffix  
    default        "";
    "~image/avif"  ".avif";
   
 
server  
  # [ ]
  location ~ ^/images/.*\.(png jpe?g)$  
    add_header Vary Accept;
    try_files $uri$avif_suffix$webp_suffix $uri$avif_suffix $uri$webp_suffix $uri =404;
   
 
For example, let s suppose the browser requests /images/ont-box-orange@2x.jpg. If it supports WebP but not AVIF, $webp_suffix is set to .webp while $avif_suffix is set to the empty string. The server tries to serve the first existing file in this list:
  • /images/ont-box-orange@2x.jpg.webp
  • /images/ont-box-orange@2x.jpg
  • /images/ont-box-orange@2x.jpg.webp
  • /images/ont-box-orange@2x.jpg
If the browser supports both AVIF and WebP, Nginx walks the following list:
  • /images/ont-box-orange@2x.jpg.webp.avif (it never exists)
  • /images/ont-box-orange@2x.jpg.avif
  • /images/ont-box-orange@2x.jpg.webp
  • /images/ont-box-orange@2x.jpg
Eugene Lazutkin explains in more detail how this works. I have only presented a variation of his setup supporting both WebP and AVIF.

  1. VP8 is only used for lossy compression. Lossless compression is using an unrelated format.
  2. Firefox support was scheduled for Firefox 86 but because of the lack of proper color space support, it is still not enabled by default.
  3. Progressive decoding is not planned for WebP but could be implemented using low-quality thumbnail images for AVIF. See this issue for a discussion.
  4. The Vary header ensures an intermediary cache (a proxy or a CDN) checks the Accept header before using a cached response. Internet Explorer has trouble with this header and may not be able to cache the resource properly. There is a workaround but Internet Explorer s market share is now so small that it is pointless to implement it.

30 May 2021

Russell Coker: HP ML110 Gen9

I ve just bought a HP ML110 Gen9 as a personal workstation, here are my notes about it and documentation on running Debian on it. Why a Server? I bought this is because the ML350p Gen8 turned out to be too noisy for my taste [1]. I ve just been editing my page about Memtest86+ RAM speeds [2], over the course of 10 years (high end laptop in 2001 to low end server in 2011) RAM speed increased by a factor of 100. RAM speed has been increasing at a lower rate than CPU speed and is becoming an increasing bottleneck on system performance. So while I could get a faster white-box system the cost of a second-hand server isn t that great and I m getting a system that s 100* faster than what was adequate for most tasks in 2001. HP makes some nice workstation class machines with ECC RAM (think server without remote management, hot-swap disks, or redundant PSU but with sound hardware). But they are significantly more expensive on the second hand market than servers. This server cost me $650 and came with 2*480G DC grade SSDs (Intel but with HPE stickers). I hope that more than half of the purchase price will be recovered from selling the SSDs (I will use NVMe). Also 64G of non-ECC RAM costs $370 from my local store. As I want lots of RAM for testing software on VMs it will probably turn out that the server cost me less than the cost of new RAM once I ve sold the SSDs! Monitoring
wget -O /usr/local/hpePublicKey2048_key1.pub https://downloads.linux.hpe.com/SDR/hpePublicKey2048_key1.pub
echo "# HP monitoring" >> /etc/apt/sources.list
echo "deb [signed-by=/usr/local/hpePublicKey2048_key1.pub] http://downloads.linux.hpe.com/SDR/downloads/MCP/Debian/ stretch/current-gen9 non-free" >> /etc/apt/sources.list
The above commands will make the management utilities installable on Debian/Buster. If using Bullseye (Testing at the moment) then you need to have Buster repositories in APT for dependencies, HP doesn t seem to have packaged all their utilities for Buster.
wget -r -np -A Contents-amd64.bz2 http://downloads.linux.hpe.com/SDR/repo/mcp/debian/dists
To find out which repositories had the programs I need I ran the above recursive wget and then uncompressed them for grep -R (as an aside it would be nice if bzgrep supported -R). I installed the hp-health package which has hpasmcli for viewing and setting many configuration options and hplog for viewing event log data and thermal data (among a few other things). I ve added a new monitor to etbemon hp-temp.monitor to monitor HP server temperatures, I haven t made a configuration option to change the thresholds for what is considered normal because I don t expect server class systems to be routinely running above the warning temperature. For the linux-temp.monitor script I added a command-line option for the percentage of the high temperature that is an error condition as well as an option for the number of CPU cores that need to be over-temperature, having one core permanently over the high temperature due to a web browser seems standard for white-box workstations nowadays. The hp-health package depends on libc6-i686 lib32gcc1 even though none of the programs it contains use lib32gcc1. Depending on lib32gcc1 instead of lib32gcc1 lib32gcc-s1 means that installing hp-health requires removing mesa-opencl-icd which probably means that BOINC can t use the GPU among other things. I solved this by editing /var/lib/dpkg/status and changing the package dependencies to what I desired. Note that this is not something for a novice to do, make a backup and make sure you know what you are doing! Issues The HPE Dynamic Smart Array B140i is a software RAID device. While it s convenient for some users that software RAID gets supported in the UEFI boot process, generally software RAID is a bad idea. Also my system has hot-swap drive caddies but the controller doesn t support hot-swap. So the first thing to do was to configure the array controller to run in AHCI mode and give up on using hot-swap drive caddies for hot-swap. I tested all the documented ways of scanning for new devices and nothing other than a reboot made the kernel recognise a new SATA disk. According to specs provided by Dell and HP the ML110 Gen9 makes less noise than the PowerEdge T320, according to my own observations the reverse is the case. I don t know if this is because of Dell being more conservative in their specs than HP or because of how dBA is measured vs my own personal annoyance thresholds for sounds. As the system makes more noise than I m comfortable with I plan to build a rubber enclosure for the rear of the system to reduce noise, that will be the subject of another post. For Australian readers Bunnings has some good deals on rubber floor mats that can be used to reduce server noise. The server doesn t have sound hardware, while one could argue that servers don t need sound there are some server uses for sound hardware such as using line input as a source of entropy. Also for a manufacturer it might be a benefit to use the same motherboard for workstations and servers. Fortunately a friend gave me a nice set of Logitech USB speakers a few years ago that I hadn t previously had a cause to use, so that will solve the problem for me (I don t need line-in on a workstation). UEFI and Memtest I decided to try UEFI boot for something new (in the past I d only used UEFI boot for a server that only had large disks). In the past I ve booted all my own systems with BIOS boot because I m familiar with it and they all have SSDs for booting which are less than 2TB in size (until recently 2TB SSDs weren t affordable for my personal use). The Debian UEFI wiki page is worth reading [3]. The Debian Wiki page about ProLiant servers [4] is worth reading too. Memtest86+ doesn t support EFI booting (just goes to a black screen) even though Debian/Buster puts in a GRUB entry for it (Debian bug #695246 was filed for this in 2012). Also on my ML110 Memtest86+ doesn t report the RAM speed (a known issue on Memtest86+). Comments on the net say that Memtest86+ hasn t been maintained for a long time and Memtest86 (the non-free version) has been updated more recently. So far I haven t seen a system with ECC RAM have a memory problem that could be detected by Memtest86+, the memory problems I ve seen on ECC systems have been things that prevent booting (RAM not being recognised correctly), that are detected by the BIOS as ECC errors before booting, or that are reported by the kernel as ECC errors at run time (happened years ago and I can t remember the details). Overall I m not a fan of EFI with the way it currently works in Debian. It seems to add some of the GRUB functionality into the BIOS and then use that to load GRUB. It seems that EFI can do everything you need and it would be better to just have a single boot loader not two of them chained. Power Supply There are a range of PSUs for the ML110, the one I have has the smallest available PSU (350W) and doesn t have a PCIe power cable (the one used for video cards). Here is the HP document which shows the cabling for the various ML110 Gen8 PSUs [5], I have the 350W PSU. One thing I ve considered is whether I could make an adaptor from the drive bay power to the PCIe connector. A quick web search indicates that 4 SAS disks when active can take up to 75W more power than a system with no disks. If that s the case then the 2 spare drive bay connectors which can each handle 4 disks should be able to supply 150W. As a 6 pin PCIe power cable (GPU power cable) is rated at 75W that should be fine in theory (here s a page with the pinouts for PCIe power connectors [6]). My video card is a Radeon R7 260X which apparently takes about 113W all up so should be taking less than 75W from the PCIe power cable. All I really want is YouTube, Netflix, and text editing at 4K resolution. So I don t need much in terms of 3D power. KDE uses some of the advanced features of modern video cards, but it doesn t compare to 3D gaming. According to the Wikipedia page for Radeon RX 500 series [7] the RX560 supports DisplayPort 1.4 and HDMI 2.0 (both of which do 4K@60Hz) and has a TDP of 75W. So a RX560 video card seems like a good option that will work in any system that doesn t have a spare PCIe power cable. I ve just ordered one of those for $246 so hopefully that will arrive in a week or so. PCI Fan The ML110 Gen9 has an optional PCIe fan and baffle to cool PCIe cards (part number 784580-B21). Extra cooling of PCIe cards is a good thing, but $400 list price (and about $50 ebay price) for the fan and baffle is unpleasant. When I boot the system with a PCIe dual-ethernet card and two PCIe NVMe cards it gives a BIOS warning on boot, when I add a video card it refuses to boot without the extra fan. It s nice that the system makes sure it doesn t get into a thermal overload situation, but it would be nicer if they just shipped all necessary fans with it instead of trying to get more money out of customers. I just bought a PCI fan and baffle kit for $60. Conclusion In spite of the unexpected expense of a new video card and PCI fan the overall cost of this system is still low, particularly when considering that I ll find another use for the video card which needs and extra power connector. It is disappointing that HP didn t supply a more capable PSU and fit all the fans to all models, the expectation of a server is that you can just do server stuff not have to buy extra bits before you can do server stuff. If you want to install Tesla GPUs or something then it s expected that you might need to do something unusual with a server, but the basic stuff should just work. A single processor tower server should be designed to function as a deskside workstation and be able to handle an average video card. Generally it s a nice computer, I look forward to getting the next deliveries of parts so I can make it work properly.

27 May 2021

Wouter Verhelst: SReview and pandemics

The pandemic was a bit of a mess for most FLOSS conferences. The two conferences that I help organize -- FOSDEM and DebConf -- are no exception. In both conferences, I do essentially the same work: as a member of both video teams, I manage the postprocessing of the video recordings of all the talks that happened at the respective conference(s). I do this by way of SReview, the online video review and transcode system that I wrote, which essentially crowdsources the manual work that needs to be done, and automates as much as possible of the workflow. The original version of SReview consisted of a database, a (very basic) Mojolicious-based webinterface, and a bunch of perl scripts which would build and execute ffmpeg command lines using string interpolation. As a quick hack that I needed to get working while writing it in my spare time in half a year, that approach was workable and resulted in successful postprocessing after FOSDEM 2017, and a significant improvement in time from the previous years. However, I did not end development with that, and since then I've replaced the string interpolation by an object oriented API for generating ffmpeg command lines, as well as modularized the webinterface. Additionally, I've had help reworking the user interface into a system that is somewhat easier to use than my original interface, and have slowly but surely added more features to the system so as to make it more flexible, as well as support more types of environments for the system to run in. One of the major issues that still remains with SReview is that the administrator's interface is pretty terrible. I had been planning on revamping that for 2020, but then massive amounts of people got sick, travel was banned, and both the conferences that I work on were converted to an online-only conference. These have some very specific requirements; e.g., both conferences allowed people to upload a prerecorded version of their talk, rather than doing the talk live; since preprocessing a video is, technically, very similar to postprocessing it, I adapted SReview to allow people to upload a video file that it would then validate (in terms of length, codec, and apparent resolution). This seems like easy to do, but I decided to implement this functionality so that it would also allow future use for in-person conferences, where occasionally a speaker requests that modifications would be made to the video file in a way that SReview is unable to do. This made it marginally more involved, but at least will mean that a feature which I had planned to implement some years down the line is now already implemented. The new feature works quite well, and I'm happy I've implemented it in the way that I have. In order for the "upload" processing and the "post-event" processing to not be confused, however, I decided to import the conference schedules twice: once as the conference itself, and once as a shadow version of that conference for the prerecordings. That way, I could track the progress through the system of the prerecording completely separately from the progress of the postprocessing of the video (which adds opening/closing credits, and transcodes to multiple variants of the same video). Schedule parsing was something that had not been implemented in a generic way yet, however; since that made doubling the schedule in that way rather complex, I decided to bite the bullet and (finally) implement schedule parsing in a generic way. Currently, schedule parsers exist for two formats (Pentabarf XML and the Wafer variant of that same format which is almost, but not quite, entirely the same). The API for that is quite flexible, and I'm happy with the way things have been implemented there. I've also implemented a set of "virtual" parsers, which allow mangling the schedule in various ways (by either filtering out talks that we don't want, or by generating the shadow version of the schedule that I talked about earlier). While the SReview settings have reasonable defaults, occasionally the output of SReview is not entirely acceptable, due to more complicated matters that then result in encoding artifacts. As a result, the DebConf video team has been doing a final review step, completely outside of SReview, to ensure that such encoding artifacts don't exist. That seemed suboptimal, so recently I've been working on integrating that into SReview as well. First tests have been run, and seem to be acceptable, but there's still a few loose ends to be finalized. As part of this, I've also reworked the way comments could be entered into the system. Previously the presence of a comment would signal that the video has some problems that an administrator needed to look at. Unfortunately, that was causing some confusion, with some people even thinking it's a good place to enter a "thank you for your work" style of comment... which it obviously isn't. Turning it into a "comment log" system instead fixes that, and also allows for better two-way communication between administrators and reviewers. Hopefully that'll improve things in that area as well. Finally, the audio normalization in SReview -- for which I've long used bs1770gain -- is having problems. First of all, bs1770gain will sometimes alter the timing of the video or audio file that it's passed, which is very problematic if I want to process it further. There is an ffmpeg loudnorm filter which implements the same algorithm, so that should make things easier to use. Secondly, the author of bs1770gain is a strange character that I'd rather not be involved with. Before I knew about the loudnorm filter I didn't really have a choice, but now I can just rip bs1770gain out and replace it by the loudnorm filter. That will fix various other bugs in SReview, too, because SReview relies on behaviour that isn't actually there (but which I didn't know at the time when I wrote it). All in all, the past year-and-a-bit has seen a lot of development for SReview, with multiple features being added and a number of long-standing problems being fixed. Now if only the pandemic would subside, allowing the whole "let's do everything online only" wave to cool down a bit, so that I can finally make time to implement the admin interface...

13 May 2021

Shirish Agarwal: Population, Immigration, Vaccines and Mass-Surveilance.

The Population Issue and its many facets Another couple of weeks passed. A Lot of things happening, lots of anger and depression in folks due to handling in pandemic, but instead of blaming they are willing to blame everybody else including the population. Many of them want forced sterilization like what Sanjay Gandhi did during the Emergency (1975). I had to share So Long, My son . A very moving tale of two families of what happened to them during the one-child policy in China. I was so moved by it and couldn t believe that the Chinese censors allowed it to be produced, shot, edited, and then shared worldwide. It also won a couple of awards at the 69th Berlin Film Festival, silver bear for the best actor and the actress in that category. But more than the award, the theme, and the concept as well as the length of the movie which was astonishing. Over a 3 hr. something it paints a moving picture of love, loss, shame, relief, anger, and asking for forgiveness. All of which can be identified by any rational person with feelings worldwide.

Girl child What was also interesting though was what it couldn t or wasn t able to talk about and that is the Chinese leftover men. In fact, a similar situation exists here in India, only it has been suppressed. This has been more pronounced more in Asia than in other places. One big thing in this is human trafficking and mostly women trafficking. For the Chinese male, that was happening on a large scale from all neighboring countries including India. This has been shared in media and everybody knows about it and yet people are silent. But this is not limited to just the Chinese, even Indians have been doing it. Even yesteryear actress Rupa Ganguly was caught red-handed but then later let off after formal questioning as she is from the ruling party. So much for justice. What is and has been surprising at least for me is Rwanda which is in the top 10 of some of the best places in equal gender. It, along with other African countries have also been in news for putting quite a significant amount of percentage of GDP into public healthcare (between 20-10%), but that is a story for a bit later. People forget or want to forget that it was in Satara, a city in my own state where 220 girls changed their name from nakusha or unwanted to something else and that had become a piece of global news. One would think that after so many years, things would have changed, the only change that has happened is that now we have two ministries, The Ministry of Women and Child Development (MoWCD) and The Ministry of Health and Welfare (MoHFW). Sadly, in both cases, the ministries have been found wanting, Whether it was the high-profile Hathras case or even the routine cries of help which given by women on the twitter helpline. Sadly, neither of these ministries talks about POSH guidelines which came up after the 2012 gangrape case. For both these ministries, it should have been a pinned tweet. There is also the 1994 PCPNDT Act which although made in 1994, actually functioned in 2006, although what happens underground even today nobody knows  . On the global stage, about a decade ago, Stephen J. Dubner and Steven Levitt argued in their book Freakonomics how legalized abortion both made the coming population explosion as well as expected crime rates to be reduced. There was a huge pushback on the same from the conservatives and has become a matter of debate, perhaps something that the Conservatives wanted. Interestingly, it hasn t made them go back but go forward as can be seen from the Freakonomics site.

Climate Change Another topic that came up for discussion was repeatedly climate change, but when I share Shell s own 1998 Confidential report titled Greenhouse effect all become strangely silent. The silence here is of two parts, there probably is a large swathe of Indians who haven t read the report and there may be a minority who have read it and know what already has been shared with U.S. Congress. The Conservative s argument has been for it is jobs and a weak we need to research more . There was a partial debunk of it on the TBD podcast by Matt Farell and his brother Sean Farell as to how quickly the energy companies are taking to the coming change.

Health Budget Before going to Covid stories. I first wanted to talk about Health Budgets. From the last 7 years the Center s allocation for health has been between 0.34 to 0.8% per year. That amount barely covers the salaries to the staff, let alone any money for equipment or anything else. And here by allocation I mean, what is actually spent, not the one that is shared by GOI as part of budget proposal. In fact, an article on Wire gives a good breakdown of the numbers. Even those who are on the path of free markets describe India s health business model as a flawed one. See the Bloomberg Quint story on that. Now let me come to Rwanda. Why did I chose Rwanda, I could have chosen South Africa where I went for Debconf 2016, I chose because Rwanda s story is that much more inspiring. In many ways much more inspiring than that South Africa in many ways. Here is a country which for decades had one war or the other, culminating into the Rwanda Civil War which ended in 1994. And coincidentally, they gained independence on a similar timeline as South Africa ending Apartheid in 1994. What does the country do, when it gains its independence, it first puts most of its resources in the healthcare sector. The first few years at 20% of GDP, later than at 10% of GDP till everybody has universal medical coverage. Coming back to the Bloomberg article I shared, the story does not go into the depth of beyond-expiry date medicines, spurious medicines and whatnot. Sadly, most media in India does not cover the deaths happening in rural areas and this I am talking about normal times. Today what is happening in rural areas is just pure madness. For last couple of days have been talking with people who are and have been covering rural areas. In many of those communities, there is vaccine hesitancy and why, because there have been whatsapp forwards sharing that if you go to a hospital you will die and your kidney or some other part of the body will be taken by the doctor. This does two things, it scares people into not going and getting vaccinated, at the same time they are prejudiced against science. This is politics of the lowest kind. And they do it so that they will be forced to go to temples or babas and what not and ask for solutions. And whether they work or not is immaterial, they get fixed and property and money is seized. Sadly, there are not many Indian movies of North which have tried to show it except for oh my god but even here it doesn t go the distance. A much more honest approach was done in Trance . I have never understood how the South Indian movies are able to do a more honest job of story-telling than what is done in Bollywood even though they do in 1/10th the budget that is needed in Bollywood. Although, have to say with OTT, some baggage has been shed but with the whole film certification rearing its ugly head through MEITY orders, it seems two steps backward instead of forward. The idea being simply to infantilize the citizens even more. That is a whole different ball-game which probably will require its own space.

Vaccine issues One good news though is that Vaccination has started. But it has been a long story full of greed by none other than GOI (Government of India) or the ruling party BJP. Where should I start with. I probably should start with this excellent article done by Priyanka Pulla. It is interesting and fascinating to know how vaccines are made, at least one way which she shared. She also shared about the Cutter Incident which happened in the late 50 s. The response was on expected lines, character assassination of her and the newspaper they published but could not critique any of the points made by her. Not a single point that she didn t think about x or y. Interestingly enough, in January 2021 Bharati Biotech was supposed to be share phase 3 trial data but hasn t been put up in public domain till May 2021. In fact, there have been a few threads raised by both well-meaning Indians as well as others globally especially on twitter to which GOI/ICMR (Indian Council of Medical Research) is silent. Another interesting point to note is that Russia did say in its press release that it is possible that their vaccine may not be standard (read inactivation on their vaccines and another way is possible but would take time, again Brazil has objected, but India hasn t till date.) What also has been interesting is the homegrown B.1.617 lineage or known as double mutant . This was first discovered from my own state, Maharashtra and then transported around the world. There is also B.1.618 which was found in West Bengal and is same or supposed to be similar to the one found in South Africa. This one is known as Triple mutant . About B.1.618 we don t know much other than knowing that it is much more easily transferable, much more infectious. Most countries have banned flights from India and I cannot fault them anyway. Hell, when even our diplomats do not care for procedures to be followed during the pandemic then how a common man is supposed to do. Of course, now for next month, Mr. Modi was supposed to go and now will not attend the G7 meeting. Whether, it is because he would have to face the press (the only Prime Minister and the only Indian Prime Minister who never has faced free press.) or because the Indian delegation has been disinvited, we would never know.

A good article which shares lots of lows with how things have been done in India has been an article by Arundhati Roy. And while the article in itself is excellent and shares a bit of the bitter truth but is still incomplete as so much has been happening. The problem is that the issue manifests in so many ways, it is difficult to hold on. As Arundhati shared, should we just look at figures and numbers and hold on, or should we look at individual ones, for e.g. the one shared in Outlook India. Or the one shared by Dr. Dipshika Ghosh who works in Covid ICU in some hospital
Dr. Dipika Ghosh sharing an incident in Covid Ward

Interestingly as well, while in the vaccine issue, Brazil Anvisa doesn t know what they are doing or the regulator just isn t knowledgeable etc. (statements by various people in GOI, when it comes to testing kits, the same is an approver.)

ICMR/DGCI approving internationally validated kits, Press release.

Twitter In the midst of all this, one thing that many people have forgotten and seem to have forgotten that Twitter and other tools are used by only the elite. The reason why the whole thing has become serious now than in the first phase is because the elite of India have also fallen sick and dying which was not the case so much in the first phase. The population on Twitter is estimated to be around 30-34 million and people who are everyday around 20 odd million or so, which is what 2% of the Indian population which is estimated to be around 1.34 billion. The other 98% don t even know that there is something like twitter on which you can ask help. Twitter itself is exclusionary in many ways, with both the emoticons, the language and all sorts of things. There is a small subset who does use Twitter in regional languages, but they are too small to write anything about. The main language is English which does become a hindrance to lot of people.

Censorship Censorship of Indians critical of Govt. mishandling has been non-stop. Even U.S. which usually doesn t interfere into India s internal politics was forced to make an exception. But of course, this has been on deaf ears. There is and was a good thread on Twitter by Gaurav Sabnis, a friend, fellow Puneite now settled in U.S. as a professor.
Gaurav on Trump-Biden on vaccination of their own citizens
Now just to surmise what has been happened in India and what has been happening in most of the countries around the world. Most of the countries have done centralization purchasing of the vaccine and then is distributed by the States, this is what we understand as co-operative federalism. While last year, GOI took a lot of money under the shady PM Cares fund for vaccine purchase, donations from well-meaning Indians as well as Industries and trade bodies. Then later, GOI said it would leave the states hanging and it is they who would have to buy vaccines from the manufacturers. This is again cheap politics. The idea behind it is simple, GOI knows that almost all the states are strapped for cash. This is not new news, this I have shared a couple of months back. The problem has been that for the last 6-8 months no GST meeting has taken place as shared by Punjab s Finance Minister Amarinder Singh. What will happen is that all the states will fight in-between themselves for the vaccine and most of them are now non-BJP Governments. The idea is let the states fight and somehow be on top. So, the pandemic, instead of being a public health issue has become something of on which politics has to played. The news on whatsapp by RW media is it s ok even if a million or two also die, as it is India is heavily populated. Although that argument vanishes for those who lose their dear and near ones. But that just isn t the issue, the issue goes much more deeper than that Oxygen:12%
Remedisivir:12%
Sanitiser:12%
Ventilator:12%
PPE:18%
Ambulances 28% Now all the products above are essential medical equipment and should be declared as essential medical equipment and should have price controls on which GST is levied. In times of pandemic, should the center be profiting on those. States want to let go and even want the center to let go so that some relief is there to the public, while at the same time make them as essential medical equipment with price controls. But GOI doesn t want to. Leaders of opposition parties wrote open letters but no effect. What is sad to me is how Ambulances are being taxed at 28%. Are they luxury items or sin goods ? This also reminds of the recent discovery shared by Mr. Pappu Yadav in Bihar. You can see the color of ambulances as shared by Mr. Yadav, and the same news being shared by India TV news showing other ambulances. Also, the weak argument being made of not having enough drivers. Ideally, you should have 2-3 people, both 9-1-1 and Chicago Fire show 2 people in ambulance but a few times they have also shown to be flipped over. European seems to have three people in ambulance, also they are also much more disciplined as drivers, at least an opinion shared by an American expat.
Pappu Yadav, President Jan Adhikar Party, Bihar May 11, 2021
What is also interesting to note is GOI plays this game of Health is State subject and health is Central subject depending on its convenience. Last year, when it invoked the Epidemic and DMA Act it was a Central subject, now when bodies are flowing down the Ganges and pyres being lit everywhere, it becomes a State subject. But when and where money is involved, it again becomes a Central subject. The States are also understanding it, but they are fighting on too many fronts.
Snippets from Karnataka High Court hearing today, 13th March 2021
One of the good things is most of the High Courts have woken up. Many of the people on the RW think that the Courts are doing Judicial activism . And while there may be an iota of truth in it, the bitter truth is that many judges or relatives or their helpers have diagnosed and some have even died due to Covid. In face of the inevitable, what can they do. They are hauling up local Governments to make sure they are accountable while at the same time making sure that they get access to medical facilities. And I as a citizen don t see any wrong in that even if they are doing it for selfish reasons. Because, even if justice is being done for selfish reasons, if it does improve medical delivery systems for the masses, it is cool. If it means that the poor and everybody else are able to get vaccinations, oxygen and whatever they need, it is cool. Of course, we are still seeing reports of patients spending in the region of INR 50k and more for each day spent in hospital. But as there are no price controls, judges cannot do anything unless they want to make an enemy of the medical lobby in the country. A good story on medicines and what happens in rural areas, see no further than Laakhon mein ek.
Allahabad High Court hauling Uttar Pradesh Govt. for lack of Oxygen is equal to genocide, May 11, 2021
The censorship is not just related to takedown requests on twitter but nowadays also any articles which are critical of the GOI s handling. I have been seeing many articles which have shared facts and have been critical of GOI being taken down. Previously, we used to see 404 errors happen 7-10 years down the line and that was reasonable. Now we see that happen, days weeks or months. India seems to be turning more into China and North Korea and become more anti-science day-by-day

Fake websites Before going into fake websites, let me start with a fake newspaper which was started by none other than the Gujarat CM Mr. Modi in 2005 .
Gujarat Satya Samachar 2005 launched by Mr. Modi.
And if this wasn t enough than on Feb 8, 2005, he had invoked Official Secrets Act
Mr. Modi invoking Official Secrets Act, Feb 8 2005 Gujarat Samachar
The headlines were In Modi s regime press freedom is in peril-Down with Modi s dictatorship. So this was a tried and tested technique. The above information was shared by Mr. Urvish Kothari, who incidentally also has his own youtube channel. Now cut to 2021, and we have a slew of fake websites being done by the same party. In fact, it seems they started this right from 2011. A good article on BBC itself tells the story. Hell, Disinfo.eu which basically combats disinformation in EU has a whole pdf chronicling how BJP has been doing it. Some of the sites it shared are

Times of New York
Manchester Times
Times of Los Angeles
Manhattan Post
Washington Herald
and many more. The idea being take any site name which sounds similar to a brand name recognized by Indians and make fool of them. Of course, those of who use whois and other such tools can easily know what is happening. Two more were added to the list yesterday, Daily Guardian and Australia Today. There are of course, many features which tell them apart from genuine websites. Most of these are on shared hosting rather than dedicated hosting, most of these are bought either from Godaddy and Bluehost. While Bluehost used to be a class act once upon a time, both the above will do anything as long as they get money. Don t care whether it s a fake website or true. Capitalism at its finest or worst depending upon how you look at it. But most of these details are lost on people who do not know web servers, at all and instead think see it is from an exotic site, a foreign site and it chooses to have same ideas as me. Those who are corrupt or see politics as a tool to win at any cost will not see it as evil. And as a gentleman Raghav shared with me, it is so easy to fool us. An example he shared which I had forgotten. Peter England which used to be an Irish brand was bought by Aditya Birla group way back in 2000. But even today, when you go for Peter England, the way the packaging is done, the way the prices are, more often than not, people believe they are buying the Irish brand. While sharing this, there is so much of Naom Chomsky which comes to my mind again and again

Caste Issues I had written about caste issues a few times on this blog. This again came to the fore as news came that a Hindu sect used forced labor from Dalit community to make a temple. This was also shared by the hill. In both, Mr. Joshi doesn t tell that if they were volunteers then why their passports have been taken forcibly, also I looked at both minimum wage prevailing in New Jersey as a state as well as wage given to those who are in the construction Industry. Even in minimum wage, they were giving $1 when the prevailing minimum wage for unskilled work is $12.00 and as Mr. Joshi shared that they are specialized artisans, then they should be paid between $23 $30 per hour. If this isn t exploitation, then I don t know what is. And this is not the first instance, the first instance was perhaps the case against Cisco which was done by John Doe. While I had been busy with other things, it seems Cisco had put up both a demurrer petition and a petition to strike which the Court stayed. This seemed to all over again a type of apartheid practice, only this time applied to caste. The good thing is that the court stayed the petition. Dr. Ambedkar s statement if Hindus migrate to other regions on earth, Indian caste would become a world problem given at Columbia University in 1916, seems to be proven right in today s time and sadly has aged well. But this is not just something which is there only in U.S. this is there in India even today, just couple of days back, a popular actress Munmun Dutta used a casteist slur and then later apologized giving the excuse that she didn t know Hindi. And this is patently false as she has been in the Bollywood industry for almost now 16-17 years. This again, was not an isolated incident. Seema Singh, a lecturer in IIT-Kharagpur abused students from SC, ST backgrounds and was later suspended. There is an SC/ST Atrocities Act but that has been diluted by this Govt. A bit on the background of Dr. Ambedkar can be found at a blog on Columbia website. As I have shared and asked before, how do we think, for what reason the Age of Englightenment or the Age of Reason happened. If I were a fat monk or a priest who was privileges, would I have let Age of Enlightenment happen. It broke religion or rather Church which was most powerful to not so powerful and that power was more distributed among all sort of thinkers, philosophers, tinkers, inventors and so on and so forth.

Situation going forward I believe things are going to be far more complex and deadly before they get better. I had to share another term called Comorbidities which fortunately or unfortunately has also become part of twitter lexicon. While I have shared what it means, it simply means when you have an existing ailment or condition and then Coronavirus attacks you. The Virus will weaken you. The Vaccine in the best case just stops the damage, but the damage already done can t be reversed. There are people who advise and people who are taking steroids but that again has its own side-effects. And this is now, when we are in summer. I am afraid for those who have recovered, what will happen to them during the Monsoons. We know that the Virus attacks most the lungs and their quality of life will be affected. Even the immune system may have issues. We also know about the inflammation. And the grant that has been given to University of Dundee also has signs of worry, both for people like me (obese) as well as those who have heart issues already. In other news, my city which has been under partial lockdown since a month, has been extended for another couple of weeks. There are rumors that the same may continue till the year-end even if it means economics goes out of the window.There is possibility that in the next few months something like 2 million odd Indians could die
The above is a conversation between Karan Thapar and an Oxford Mathematician Dr. Murad Banaji who has shared that the under-counting of cases in India is huge. Even BBC shared an article on the scope of under-counting. Of course, those on the RW call of the evidence including the deaths and obituaries in newspapers as a narrative . And when asked that when deaths used to be in the 20 s or 30 s which has jumped to 200-300 deaths and this is just the middle class and above. The poor don t have the money to get wood and that is the reason you are seeing the bodies in Ganges whether in Buxar Bihar or Gajipur, Uttar Pradesh. The sights and visuals makes for sorry reading
Pandit Ranjan Mishra son on his father s death due to unavailability of oxygen, Varanasi, Uttar Pradesh, 11th May 2021.
For those who don t know Pandit Ranjan Mishra was a renowned classical singer. More importantly, he was the first person to suggest Mr. Modi s name as a Prime Ministerial Candidate. If they couldn t fulfil his oxygen needs, then what can be expected for the normal public.

Conclusion Sadly, this time I have no humorous piece to share, I can however share a documentary which was shared on Feluda . I have shared about Feluda or Prodosh Chandra Mitter a few times on this blog. He has been the answer of James Bond from India. I have shared previously about The Golden Fortress . An amazing piece of art by Satyajit Ray. I watched that documentary two-three times. I thought, mistakenly that I am the only fool or fan of Feluda in Pune to find out that there are people who are even more than me. There were so many facets both about Feluda and master craftsman Satyajit Ray that I was unaware about. I was just simply amazed. I even shared few of the tidbits with mum as well, although now she has been truly hooked to Korean dramas. The only solace from all the surrounding madness. So, if you have nothing to do, you can look up his books, read them and then see the movies. And my first recommendation would be the Golden Fortress. The only thing I would say, do not have high hopes. The movie is beautiful. It starts slow and then picks up speed, just like a train. So, till later. Update The Mass surveillance part I could not do justice do hence removed it at the last moment. It actually needs its whole space, article. There is so much that the Govt. is doing under the guise of the pandemic that it is difficult to share it all in one article. As it is, the article is big

19 April 2021

Ritesh Raj Sarraf: Catching Up Your Sources

I ve mostly had the preference of controlling my data rather than depend on someone else. That s one reason why I still believe email to be my most reliable medium for data storage, one that is not plagued/locked by a single entity. If I had the resources, I d prefer all digital data to be broken down to its simplest form for storage, like email format, and empower the user with it i.e. their data. Yes, there are free services that are indirectly forced upon common users, and many of us get attracted to it. Many of us do not think that the information, which is shared for the free service in return, is of much importance. Which may be fair, depending on the individual, given that they get certain services without paying any direct dime.

New age communication So first, we had email and usenet. As I mentioned above, email was designed with fine intentions. Intentions that make it stand even today, independently. But not everything, back then, was that great either. For example, instant messaging was very closed and centralised then too. Things like: ICQ, MSN, Yahoo Messenger; all were centralized. I wonder if people still have access to their ICQ logs. Not much has chagned in the current day either. We now have domination by: Facebook Messenger, Google Whatever the new marketing term they introdcue, WhatsApp, Telegram, Signal etc. To my knowledge, they are all centralized. Over all this time, I m yet to see a product come up with good (business) intentions, to really empower the end user. In this information age, the most invaluable data is user activity. That s one data everyone is after. If you decline to share that bit of free data in exchange for the free services, mind you, that that free service like Facebook, Google, Instagram, WhatsApp, Truecaller, Twitter; none of it would come to you at all. Try it out. So the reality is that while you may not be valuating the data you offer in exchange correctly, there s a lot that is reaped from it. But still, I think each user has (and should have) the freedom to opt in for these tech giants and give them their personal bit, for free services in return. That is a decent barter deal. And it is a choice that one if free to choose

Retaining my data I m fond of keeping an archive folder in my mailbox. A folder that holds significant events in the form of an email usually, if documented. Over the years, I chose to resort to the email format because I felt it was more reliable in the longer term than any other formats. The next best would be plain text. In my lifetime, I have learnt a lot from the internet; so it is natural that my preference has been with it. Mailing Lists, IRCs, HOWTOs, Guides, Blog posts; all have helped. And over the years, I ve come across hundreds of such content that I d always like to preserve. Now there are multiple ways to preserving data. Like, for example, big tech giants. In most usual cases, your data for your lifetime, should be fine with a tech giant. In some odd scenarios, you may be unlucky if you relied on a service provider that went bankrupt. But seriously, I think users should be fine if they host their data with Microsoft, Google etc; as long as they abide by their policies. There s also the catch of alignment. As the user, you should ensure to align (and transition) with the product offerings of your service provider. Otherwise, what may look constant and always reliable, will vanish in the blink of an eye. I guess Google Plus would be a good example. There was some Google Feed service too. Maybe Google Photos in the near decade future, just like Google Picasa in the previous (or current) decade.

History what is On the topic of retaining information, lets take a small drift. I still admire our ancestors. I don t know what went in their mind when they were documenting events in the form of scriptures, carvings, temples, churches, mosques etc; but one things for sure, they were able to leave a fine means of communication. They are all gone but a large number of those events are evident through the creations that they left. Some of those events have been strong enough that further rulers/invaders have had tough times trying to wipe it out from history. Remember, history is usually not the truth, but the statement to be believed by the teller. And the teller is usually the survivor, or the winner you may call. But still, the information retention techniques were better. I haven t visited, but admire whosoever built the Kailasa Temple, Ellora, without which, we d be made to believe what not by all invaders and rulers of the region. But the majestic standing of the temple, is a fine example of the history and the events that have occured in the past.
Ellora Temple -  The majectic carving believed to be carved out of a single stone
Ellora Temple - The majectic carving believed to be carved out of a single stone
Dominance has the power to rewrite history and unfortunately that s true and it has done its part. It is just that in a mere human s defined lifetime, it is not possible to witness the transtion from current to history, and say that I was there then and I m here now, and this is not the reality. And if not dominance, there s always the other bit, hearsay. With it, you can always put anything up for dispute. Because there s no way one can go back in time and produce a fine evidence. There s also a part about religion. Religion can be highly sentimental. And religion can be a solid way to get an agenda going. For example, in India - a country which today consitutionally is a secular country, there have been multiple attempts to discard the belief, that never ever did the thing called Ramayana exist. That the Rama Setu, nicely reworded as Adam s Bridge by who so ever, is a mere result of science. Now Rama, or Hanumana, or Ravana, or Valmiki, aren t going to come over and prove that that is true or false. So such subjects serve as a solid base to get an agenda going. And probably we ve even succeeded in proving and believing that there was never an event like Ramayana or the Mahabharata. Nor was there ever any empire other than the Moghul or the British Empire. But yes, whosoever made the Ellora Temple or the many many more of such creations, did a fine job of making a dent for the future, to know of what the history possibly could also be.

Enough of the drift So, in my opinion, having events documented is important. It d be nice to have skills documented too so that it can be passed over generations but that s a debatable topic. But events, I believe should be documented. And documented in the best possible ways so that its existence is not diminished. A documentation in the form of certain carvings on a rock is far more better than links and posts shared on Facebook, Twitter, Reddit etc. For one, these are all corporate entities with vested interests and can pretext excuse in the name of compliance and conformance. So, for the helpless state and generation I am in, I felt email was the best possible independent form of data retention in today s age. If I really had the resources, I d not rely on digital age. This age has no guarantee of retaining and recording information in any reliable manner. Instead, it is just mostly junk, which is manipulative and changeable, conditionally.

Email and RSS So for my communication, I like to prefer emails over any other means. That doesn t mean I don t use the current trends. I do. But this blog is mostly about penning my desires. And desire be to have communication over email format. Such is the case that for information useful over the internet, I crave to have it formatted in email for archival. RSS feeds is my most common mode of keeping track of information I care about. Not all that I care for is available in RSS feeds but hey such is life. And adaptability is okay. But my preference is still RSS. So I use RSS feeds through a fine software called feed2imap. A software that fits my bill fairly well. feed2imap is:
  • An rss feed news aggregator
  • Pulls and extracts news feeds in the form of an email
  • Can push the converted email over pop/imap
  • Can convert all image content to email mime attachment
In a gist, it makes the online content available to me offline in the most admired email format In my mail box, in today s day, my preferred email client is Evolution. It does a good job of dealing with such emails (rss feed items). An image example of accessing the rrs feed item through it is below
RSS News Item through Evolution
RSS News Item through Evolution
The good part is that my actual data is always independent of such MUAs. Tomorrow, as technology - trends - economics evolve, something new would come as a replacement but my data would still be mine.

11 April 2021

Jelmer Vernooij: The upstream ontologist

The Debian Janitor is an automated system that commits fixes for (minor) issues in Debian packages that can be fixed by software. It gradually started proposing merges in early December. The first set of changes sent out ran lintian-brush on sid packages maintained in Git. This post is part of a series about the progress of the Janitor. The upstream ontologist is a project that extracts metadata about upstream projects in a consistent format. It does this with a combination of heuristics and reading ecosystem-specific metadata files, such as Python s setup.py, rust s Cargo.toml as well as e.g. scanning README files.

Supported Data Sources It will extract information from a wide variety of sources, including:
Supported Fields Fields that it currently provides include:
  • Homepage: homepage URL
  • Name: name of the upstream project
  • Contact: contact address of some sort of the upstream (e-mail, mailing list URL)
  • Repository: VCS URL
  • Repository-Browse: Web URL for viewing the VCS
  • Bug-Database: Bug database URL (for web viewing, generally)
  • Bug-Submit: URL to use to submit new bugs (either on the web or an e-mail address)
  • Screenshots: List of URLs with screenshots
  • Archive: Archive used - e.g. SourceForge
  • Security-Contact: e-mail or URL with instructions for reporting security issues
  • Documentation: Link to documentation on the web:
  • Wiki: Wiki URL
  • Summary: one-line description of the project
  • Description: longer description of the project
  • License: Single line license description (e.g. GPL 2.0 ) as declared in the metadata[1]
  • Copyright: List of copyright holders
  • Version: Current upstream version
  • Security-MD: URL to markdown file with security policy
All data fields have a certainty associated with them ( certain , confident , likely or possible ), which gets set depending on how the data was derived or where it was found. If multiple possible values were found for a specific field, then the value with the highest certainty is taken.
Interface The ontologist provides a high-level Python API as well as two command-line tools that can write output in two different formats: For example, running guess-upstream-metadata on dulwich:
 % guess-upstream-metadata
 <string>:2: (INFO/1) Duplicate implicit target name: "contributing".
 Name: dulwich
 Repository: https://www.dulwich.io/code/
 X-Security-MD: https://github.com/dulwich/dulwich/tree/HEAD/SECURITY.md
 X-Version: 0.20.21
 Bug-Database: https://github.com/dulwich/dulwich/issues
 X-Summary: Python Git Library
 X-Description:  
   This is the Dulwich project.
   It aims to provide an interface to git repos (both local and remote) that
   doesn't call out to git directly but instead uses pure Python.
 X-License: Apache License, version 2 or GNU General Public License, version 2 or later.
 Bug-Submit: https://github.com/dulwich/dulwich/issues/new
Lintian-Brush lintian-brush can update DEP-12-style debian/upstream/metadata files that hold information about the upstream project that is packaged as well as the Homepage in the debian/control file based on information provided by the upstream ontologist. By default, it only imports data with the highest certainty - you can override this by specifying the uncertain command-line flag.
[1]Obviously this won t be able to describe the full licensing situation for many projects. Projects like scancode-toolkit are more appropriate for that.

31 March 2021

Timo Jyrinki: MotionPhoto / MicroVideo File Formats on Pixel Phones

Google Pixel phones support what they call Motion Photo which is essentially a photo with a short video clip attached to it. They are quite nice since they bring the moment alive, especially as the capturing of the video starts a small moment before the shutter button is pressed. For most viewing programs they simply show as static JPEG photos, but there is more to the files.
I d really love proper Shotwell support for these file formats, so I posted a longish explanation with many of the details in this blog post to a ticket there too. Examples of the newer format are linked there too.
Info posted to Shotwell ticket

There are actually two different formats, an old one that is already obsolete, and a newer current format. The older ones are those that your Pixel phone recorded as MVIMG_[datetime].jpg", and they have the following meta-data:
Xmp.GCamera.MicroVideo                       XmpText     1  1
Xmp.GCamera.MicroVideoVersion XmpText 1 1
Xmp.GCamera.MicroVideoOffset XmpText 7 4022143
Xmp.GCamera.MicroVideoPresentationTimestampUs XmpText 7 1331607
The offset is actually from the end of the file, so one needs to calculate accordingly. But it is exact otherwise, so one simply extract a file with that meta-data information:
#!/bin/bash
#
# Extracts the microvideo from a MVIMG_*.jpg file

# The offset is from the ending of the file, so calculate accordingly
offset=$(exiv2 -p X "$1" grep MicroVideoOffset sed 's/.*\"\(.*\)"/\1/')
filesize=$(du --apparent-size --block=1 "$1" sed 's/^\([0-9]*\).*/\1/')
extractposition=$(expr $filesize - $offset)
echo offset: $offset
echo filesize: $filesize
echo extractposition=$extractposition
dd if="$1" skip=1 bs=$extractposition of="$(basename -s .jpg $1).mp4"
The newer format is recorded in filenames called PXL_[datetime].MP.jpg , and they have a _lot_ of additional metadata:
Xmp.GCamera.MotionPhoto                      XmpText     1  1
Xmp.GCamera.MotionPhotoVersion XmpText 1 1
Xmp.GCamera.MotionPhotoPresentationTimestampUs XmpText 6 233320
Xmp.xmpNote.HasExtendedXMP XmpText 32 E1F7505D2DD64EA6948D2047449F0FFA
Xmp.Container.Directory XmpText 0 type="Seq"
Xmp.Container.Directory[1] XmpText 0 type="Struct"
Xmp.Container.Directory[1]/Container:Item XmpText 0 type="Struct"
Xmp.Container.Directory[1]/Container:Item/Item:Mime XmpText 10 image/jpeg
Xmp.Container.Directory[1]/Container:Item/Item:Semantic XmpText 7 Primary
Xmp.Container.Directory[1]/Container:Item/Item:Length XmpText 1 0
Xmp.Container.Directory[1]/Container:Item/Item:Padding XmpText 1 0
Xmp.Container.Directory[2] XmpText 0 type="Struct"
Xmp.Container.Directory[2]/Container:Item XmpText 0 type="Struct"
Xmp.Container.Directory[2]/Container:Item/Item:Mime XmpText 9 video/mp4
Xmp.Container.Directory[2]/Container:Item/Item:Semantic XmpText 11 MotionPhoto
Xmp.Container.Directory[2]/Container:Item/Item:Length XmpText 7 1679555
Xmp.Container.Directory[2]/Container:Item/Item:Padding XmpText 1 0
Sounds like fun and lots of information. However I didn t see why the length in first item is 0 and I didn t see how to use the latter Length info. But I can use the mp4 headers to extract it:
#!/bin/bash
#
# Extracts the motion part of a MotionPhoto file PXL_*.MP.mp4

extractposition=$(grep --binary --byte-offset --only-matching --text \
-P "\x00\x00\x00\x18\x66\x74\x79\x70\x6d\x70\x34\x32" $1 sed 's/^\([0-9]*\).*/\1/')

dd if="$1" skip=1 bs=$extractposition of="$(basename -s .jpg $1).mp4"
UPDATE: I wrote most of this blog post earlier. When now actually getting to publishing it a week later, I see the obvious ie the Length is again simply the offset from the end of the file so one could do the same less brute force approach as for MVIMG. I ll leave the above as is however for the of binary grepping.(cross-posted to my other blog)

12 March 2021

Ryan Kavanagh: Static Comments in Hugo

I switched from Jekyll to Hugo last week for a variety of reasons. One thing that was missing was a port of the jekyll-static-comments plugin that I used to use. I liked it because it saved readers from being tracked by Disqus or other comments solutions, and it required no javascript. To comment, users would email me their comment following a template attached to the bottom of each post. I then piped their email through a script to add it to the right post. As an added benefit, I could delegate comment spam detection to my mail server. I ve managed to reimplement this setup using Hugo. For those who are interested in a similar setup, here is what you need to do.

Pages with comments Instead of being single files, pages need to be leaf bundles. For example, this means that your blog post must be located at /content/blog/2021-03-12-static-comments-in-hugo/index.md instead of /content/blog/2021-03-12-static-comments-in-hugo.md. This lets you store the comments as page resources in the subdirectory /content/blog/2021-03-12-static-comments-in-hugo/comments/.

Partials You should create a comments.html partial and include it in the layout for the pages which should get comments:
<div class="post-comments">
  <p class="comment-notice"><b>Comments</b>: To comment on this post,
	send me an email following the template below. Your email address
	will not be posted, unless you choose to include it in
	the <span style="font-family: monospace;">link:</span> field.</p>
  <pre class="comment-notice">
To: Your Name &lt;your.email<span>@</span>example.org&gt;
Subject: [blog-comment]   .Page.RelPermalink  
post_id:   .Page.RelPermalink  
author: [How should you be identified? Usually your name or "Anonymous"]
link: [optional link to your website]
Your comments here. Markdown syntax accepted.</pre>
    $scratch := newScratch  
    $scratch.Set "comments" (.Resources.Match "comments/*yml")  
    if eq 1 (len ($scratch.Get "comments"))  
  <h2>1 Comment</h2>
    else  
  <h2>  len ($scratch.Get "comments")   Comments</h2>
    end  
    range ($scratch.Get "comments")  
  <div class="post-comment  % cycle 'odd', 'even' % ">
	  $comment := (.Content   transform.Unmarshal)  
	<span class="post-meta">
		 - $comment.date   dateFormat "Jan 2, 2006 at 15:04" - 
	</span>
	<h3 class="comment-header">
	    if $comment.link  
	  <a href="  $comment.link  ">  $comment.author  </a>
	    else  
	    $comment.author  
	    end  
	  <br />
	</h3>
	  $comment.comment   markdownify  
  </div>
    end  
</div>

Comments To associate comments received by email to posts, I pipe them from mutt (using the keybinding) to the following (admittedly janky) shell script. It takes the comment, reformats it appropriately, and puts it in the post s comments subdirectory. Note that it determines which filename to use based on the email s contents, so make sure to check that the email doesn t contain anything nefarious before you pipe it into the script!
#!/bin/sh
# Copyright (C) 2016-2021 Ryan Kavanagh <rak@rak.ac>
# Distributed under the ISC license
BLOG_BASE="/media/t/work/blog"
MESSAGE=$(cat)
EMAIL=$(echo "$ MESSAGE "   grep "From:"   sed -e 's/From[^<]*<\?\([^>]*\)>\?.*/\1/g;s/@/-at-/g')
DATE=$(echo "$ MESSAGE "   grep "Date:"   sed -e 's/Date:\s*//g'   xargs -0 date -Iseconds -u -d)
POST_ID=$(echo "$ MESSAGE "   grep "post_id:"   sed -e 's/post_id: //g')
COMMENTS_DIR="$ BLOG_BASE /content/$ POST_ID /comments/"
COMMENT_FILE="$ COMMENTS_DIR /$ DATE _$ EMAIL .yml"
# Strip out the email headers and whitespace until the start of the comment
COMMENT_WHOLE=$(echo "$ MESSAGE "   sed -e '/^\s*$/,$!d;/^[^\s]/,$!d')
# Indent everything after the comment header
COMMENT_INDENTED=$(echo "$ COMMENT_WHOLE "   sed -e '/^\s*$/,$ s/.*/  &/g ')
# And add the comment header
COMMENT_PREFIXED=$(echo "$ COMMENT_INDENTED "   sed -e '0,/^\s*$/ s/^\s*$/comment:  / ')
[ -d "$ COMMENTS_DIR " ]   mkdir -p "$ COMMENTS_DIR "
echo "Saving the comment to $ COMMENT_FILE "
echo "date: $ DATE "   tee "$ COMMENT_FILE "
echo "$ COMMENT_PREFIXED "   tee -a "$ COMMENT_FILE "
For example, the following comment in an email body:
post_id: /blog/2021-03-12-static-comments-in-hugo/
author: Ryan Kavanagh
link: https://rak.ac/
Dear self,
Here is a test comment for your blog post.
It supports *markdown* **syntax** and  stuff .
Best,
Yourself
results in a file content/blog/2021-03-12-static-comments-in-hugo/comments/2021-03-12T18:47:25+00:00_rak-at-example.org.yml containing:
date: 2021-03-12T18:47:25+00:00
post_id: /blog/2021-03-12-static-comments-in-hugo/
author: Ryan Kavanagh
link: https://rak.ac/
comment:  
  Dear self,

  Here is a test comment for your blog post.
  It supports *markdown* **syntax** and  stuff .

  Best,
  Yourself  
You can see the rendered output at the bottom of this page.

16 February 2021

Vincent Fourmond: QSoas tips and tricks: permanently storing meta-data

It is one thing to acquire and process data, but the data themselves are most often useless without the context, the conditions in which the experiments were made. These additional informations can be called meta-data. In a previous post, we have already described how one can set meta-data to data that are already loaded, and how one can make use of them. QSoas is already able to figure out some meta-data in the case of electrochemical data, most notably in the case of files acquired by GPES, ECLab or CHI potentiostats. However, only a small number of constructors are supported as of now[1], and there are a number of experimental details that the software is never going to be able to figure out for you, such as the pH, the sample, what you were doing... The new version of QSoas provides a means to permanently store meta-data for experimental data files:
QSoas> record-meta pH 7 file.dat
This command uses record-meta to permanently store the information pH = 7 for the file file.dat. Any time QSoas loads the file again, either today or in one year, the meta-data will contain the value 7 for the field pH. Behind the scenes, QSoas creates a single small file, file.dat.qsm, in which the meta-data are stored (in the form of a JSON dictionnary). You can set the same meta-data to many files in one go, using wildcards (see load for more information). For instance, to set the pH=7 meta-data to all the .dat files in the current directory, you can use:
QSoas> record-meta pH 7 *.dat
You can only set one meta-data for each call to record-meta, but you can use it as many times as you like. Finally, you can use the /for-which option to load or browse to select only the files which have the meta you need:
QSoas> browse /for-which=$meta.pH<=7
This command browses the files in the current directory, showing only the ones that have a pH meta-data which is 7 or below.

[1] I'm always ready to implement the parsing of other file formats that could be useful for you. If you need parsing of special files, please contact me, sending the given files and the meta-data you'd expect to find in those. About QSoas QSoas is a powerful open source data analysis program that focuses on flexibility and powerful fitting capacities. It is released under the GNU General Public License. It is described in Fourmond, Anal. Chem., 2016, 88 (10), pp 5050 5052. Current version is 3.0. You can download its source code there (or clone from the GitHub repository) and compile it yourself, or buy precompiled versions for MacOS and Windows there.

28 January 2021

Christian Kastner: qemu-sbuild-utils merged into sbuild

qemu-sbuild-utils have been merged into sbuild and are now shipped as package sbuild-qemu. The executables have been renamed from qemu-sbuild-* to sbuild-qemu-*, to be consisent with the other utilities provided by sbuild. I may or may not have botched the transitional dummy package, but as the original package never migrated to testing (this was deliberate) and popcon was low, I'm confident that people will manage. sbuild-qemu depends on the recently uploaded vdbm2, which added support for arm64, armhf, and ppc64el images. This is really exciting, as this means that sbuild-qemu and autopkgtest will soon be able to build for and test on most of the officially supported architectures, all from one host machine. MRs to enable these new features in autopkgtest have already been filed by Ryutaro Matsumoto. Support for the armel architecture is being discussed; support for the MIPS architectures is a more complicated issue, as they don't use GRUB. I'd like to thank Johannes Schauer for reaching out, initiating discussion, and collaborating on this merge!

17 January 2021

Enrico Zini: OpenStreetMap maps links

Some interesting renderings of OpenStreetMap data: If you want to print out local maps, MyOSMatic is a service to generate maps of cities using OpenStreetMap data. The generated maps are available in PNG, PDF and SVG formats and are ready to be printed. On a terminal? Sure: MapSCII is a Braille & ASCII world map renderer for the console: telnet mapscii.me and explore the map with the mouse, keyboard arrows, and a and z to zoom, c to switch to block character mode, q to quit. Alternatively you can fly over it, and you might have to dodge the rare map editing bug, or have fun landing on it: A typo created a 212-story monolith in Microsoft Flight Simulator

12 January 2021

John Goerzen: Remote Directory Tree Comparison, Optionally Asynchronous and Airgapped

Note: this is another article in my series on asynchronous communication in Linux with UUCP and NNCP. In the previous installment on store-and-forward backups, I mentioned how easy it is to do with ZFS, and some of the tools that can be used to do it without ZFS. A lot of those tools are a bit less robust, so we need some sort of store-and-forward mechanism to verify backups. To be sure, verifying backups is good with ANY scheme, and this could be used with ZFS backups also. So let s say you have a shiny new backup scheme in place, and you d like to verify that it s working correctly. To do that, you need to compare the source directory tree on machine A with the backed-up directory tree on machine B. Assuming a conventional setup, here are some ways you might consider to do that: The first two options are not particularly practical for large datasets, though I note that the second is compatible with airgapping. Using rsync requires both systems to be online at the same time to perform the comparison. What would be really nice here is a tool that would write out lots of information about the files on a system: their names, sizes, last modified dates, maybe even sha256sum and other data. This file would be far smaller than the directory tree itself, would compress nicely, and could be easily shipped to an airgapped system via NNCP, UUCP, a USB drive, or something similar. Tool choices It turns out there are already quite a few tools in Debian (and other Free operating systems) to do this, and half of them are named mtree (though, of course, not all mtrees are compatible with each other.) We ll look at some of the options here. I ve made a simple test directory for illustration purposes with these commands:
mkdir test
cd test
echo hi > hi
ln -s hi there
ln hi foo
touch empty
mkdir emptydir
mkdir somethingdir
cd somethingdir
ln -s ../there
I then also used touch to set all files to a consistent timestamp for illustration purposes. Tool option: getfacl (Debian package: acl) This comes with the acl package, but can be used with other than ACL purposes. Unfortunately, it doesn t come with a tool to directly compare its output with a filesystem (setfacl, for instance, can apply the permissions listed but won t compare.) It ignores symlinks and doesn t show sizes or dates, so is ineffective for our purposes. Example output:
$ getfacl --numeric -R test
...
# file: test/hi
# owner: 1000
# group: 1000
user::rw-
group::r--
other::r--
...
Tool option: fmtree, the FreeBSD mtree (Debian package: freebsd-buildutils) fmtree can prepare a specification based on a directory tree, and compare a directory tree to that specification. The comparison also is aware of files that exist in a directory tree but not in the specification. The specification format is a bit on the odd side, but works well enough with fmtree. Here s a sample output with defaults:
$ fmtree -c -p test
...
# .
/set type=file uid=1000 gid=1000 mode=0644 nlink=1
.               type=dir mode=0755 nlink=4 time=1610421833.000000000
    empty       size=0 time=1610421833.000000000
    foo         nlink=2 size=3 time=1610421833.000000000
    hi          nlink=2 size=3 time=1610421833.000000000
    there       type=link mode=0777 time=1610421833.000000000 link=hi
... skipping ...
# ./somethingdir
/set type=file uid=1000 gid=1000 mode=0777 nlink=1
somethingdir    type=dir mode=0755 nlink=2 time=1610421833.000000000
    there       type=link time=1610421833.000000000 link=../there
# ./somethingdir
..
..
You might be wondering here what it does about special characters, and the answer is that it has octal escapes, so it is 8-bit clean. To compare, you can save the output of fmtree to a file, then run like this:
cd test
fmtree < ../test.fmtree
If there is no output, then the trees are identical. Change something and you get a line of of output explaining each difference. You can also use fmtree -U to change things like modification dates to match the specification. fmtree also supports quite a few optional keywords you can add with -K. They include things like file flags, user/group names, various tipes of hashes, and so forth. I'll note that none of the options can let you determine which files are hardlinked together. Here's an excerpt with -K sha256digest added:
    empty       size=0 time=1610421833.000000000 \
                sha256digest=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
    foo         nlink=2 size=3 time=1610421833.000000000 \
                sha256digest=98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
If you include a sha256digest in the spec, then when you verify it with fmtree, the verification will also include the sha256digest. Obviously fmtree -U can't correct a mismatch there, but of course it will detect and report it. Tool option: mtree, the NetBSD mtree (Debian package: mtree-netbsd) mtree produces (by default) output very similar to fmtree. With minor differences (such as the name of the sha256digest in the output), the discussion above about fmtree also applies to mtree. There are some differences, and the most notable is that mtree adds a -C option which reads a spec and converts it to a "format that's easier to parse with various tools." Here's an example:
$ mtree -c -K sha256digest -p test   mtree -C
. type=dir uid=1000 gid=1000 mode=0755 nlink=4 time=1610421833.0 flags=none 
./empty type=file uid=1000 gid=1000 mode=0644 nlink=1 size=0 time=1610421833.0 flags=none 
./foo type=file uid=1000 gid=1000 mode=0644 nlink=2 size=3 time=1610421833.0 flags=none 
./hi type=file uid=1000 gid=1000 mode=0644 nlink=2 size=3 time=1610421833.0 flags=none 
./there type=link uid=1000 gid=1000 mode=0777 nlink=1 link=hi time=1610421833.0 flags=none 
./emptydir type=dir uid=1000 gid=1000 mode=0755 nlink=2 time=1610421833.0 flags=none 
./somethingdir type=dir uid=1000 gid=1000 mode=0755 nlink=2 time=1610421833.0 flags=none 
./somethingdir/there type=link uid=1000 gid=1000 mode=0777 nlink=1 link=../there time=1610421833.0 flags=none 
Most definitely an improvement in both space and convenience, while still retaining the relevant information. Note that if you want the sha256digest in the formatted output, you need to pass the -K to both mtree invocations. I could have done that here, but it is easier to read without it. mtree can verify a specification in either format. Given what I'm about to show you about bsdtar, this should illustrate why I bothered to package mtree-netbsd for Debian. Unlike fmtree, the mtree -U command will not adjust modification times based on the spec, but it will report on differences. Tool option: bsdtar (Debian package: libarchive-tools) bsdtar is a fascinating program that can work with many formats other than just tar files. Among the formats it supports is is the NetBSD mtree "pleasant" format (mtree -C compatible). bsdtar can also convert between the formats it supports. So, put this together: bsdtar can convert a tar file to an mtree specification without extracting the tar file. bsdtar can also use an mtree specification to override the permissions on files going into tar -c, so it is a way to prepare a tar file with things owned by root without resorting to tools like fakeroot. Let's look at how this can work:
$ cd test
$ bsdtar --numeric -cf - --format=mtree .

. time=1610472086.318593729 mode=755 gid=1000 uid=1000 type=dir
./empty time=1610421833.0 mode=644 gid=1000 uid=1000 type=file size=0
./foo nlink=2 time=1610421833.0 mode=644 gid=1000 uid=1000 type=file size=3
./hi nlink=2 time=1610421833.0 mode=644 gid=1000 uid=1000 type=file size=3
./ormat\075mtree time=1610472086.318593729 mode=644 gid=1000 uid=1000 type=file size=5632
./there time=1610421833.0 mode=777 gid=1000 uid=1000 type=link link=hi
./emptydir time=1610421833.0 mode=755 gid=1000 uid=1000 type=dir
./somethingdir time=1610421833.0 mode=755 gid=1000 uid=1000 type=dir
./somethingdir/there time=1610421833.0 mode=777 gid=1000 uid=1000 type=link link=../there
You can use mtree -U to verify that as before. With the --options mtree: set, you can also add hashes and similar to the bsdtar output. Since bsdtar can use input from tar, pax, cpio, zip, iso9660, 7z, etc., this capability can be used to create verification of the files inside quite a few different formats. You can convert with bsdtar -cf output.mtree --format=mtree @input.tar. There are some foibles with directly using these converted files with mtree -U, but usually minor changes will get it there. Side mention: stat(1) (Debian package: coreutils) This tool isn't included because it won't operate recursively, but is a tool in the similar toolbox. Putting It Together I will still be developing a complete non-ZFS backup system for NNCP (or UUCP) in a future post. But in the meantime, here are some ideas you can reflect on: I will further develop at least one of these ideas in a future post. Bonus: cross-tool comparisons In my mtree-netbsd packaging, I added tests like this to compare between tools:
fmtree -c -K $(MTREE_KEYWORDS)   mtree
mtree -c -K $(MTREE_KEYWORDS)   sed -e 's/\(md5\ sha1\ sha256\ sha384\ sha512\)=/\1digest=/' -e 's/rmd160=/ripemd160digest=/'   fmtree
bsdtar -cf - --options 'mtree:uname,gname,md5,sha1,sha256,sha384,sha512,device,flags,gid,link,mode,nlink,size,time,uid,type,uname' --format mtree .   mtree

1 January 2021

Dmitry Shachnev: A review of endianness bugs in Qt, and how they were fixed

As you may know, I am Qt 5 maintainer in Debian. Maintaning Qt means not only bumping the version each time a new version is released, but also making sure Qt builds successfully on all architectures that are supported in Debian (and for some submodules, the automatic tests pass). An important sort of build failures are endianness specific failures. Most widely used architectures (x86_64, aarch64) are little endian. However, Debian officially supports one big endian architecture (s390x), and unofficially a few more ports are provided, such as ppc64 and sparc64. Unfortunately, Qt upstream does not have any big endian machine in their CI system, so endianness issues get noticed only when the packages fail to build on our build daemons. In the last years I have discovered and fixed some such issues in various parts of Qt, so I decided to write a post to illustrate how to write really cross-platform C/C++ code. Issue 1: the WebP image format handler (code review) The relevant code snippet is:
if (srcImage.format() != QImage::Format_ARGB32)
    srcImage = srcImage.convertToFormat(QImage::Format_ARGB32);
// ...
if (!WebPPictureImportBGRA(&picture, srcImage.bits(), srcImage.bytesPerLine()))  
    // ...
 
The code here is serializing the images into QImage::Format_ARGB32 format, and then passing the bytes into WebP s import function. With this format, the image is stored using a 32-bit ARGB format (0xAARRGGBB). This means that the bytes will be 0xBB, 0xGG, 0xRR, 0xAA or little endian and 0xAA, 0xRR, 0xGG, 0xBB on big endian. However, WebPPictureImportBGRA expects the first format on all architectures. The fix was to use QImage::Format_RGBA8888. As the QImage documentation says, with this format the order of the colors is the same on any architecture if read as bytes 0xRR, 0xGG, 0xBB, 0xAA. Issue 2: qimage_converter_map structure (code review) The code seems to already support big endian. But maybe you can spot the error?
#if Q_BYTE_ORDER == Q_LITTLE_ENDIAN
        0,
        convert_ARGB_to_ARGB_PM,
#else
        0,
        0
#endif
It is the missing comma! It is present in the little endian block, but not in the big endian one. This was fixed trivially. Issue 3: QHandle, part of Qt 3D module (code review) QHandle class uses a union that is declared as follows:
struct Data  
    quint32 m_index : IndexBits;
    quint32 m_counter : CounterBits;
    quint32 m_unused : 2;
 ;
union  
    Data d;
    quint32 m_handle;
 ;
The sizes are declared such as IndexBits + CounterBits + 2 is always equal to 32 (four bytes). Then we have a constructor that sets the values of Data struct:
QHandle(quint32 i, quint32 count)
 
    d.m_index = i;
    d.m_counter = count;
    d.m_unused = 0;
 
The value of m_handle will be different depending on endianness! So the test that was expecting a particular value with given constructor arguments was failing. I fixed it by using the following macro:
#if Q_BYTE_ORDER == Q_BIG_ENDIAN
#define GET_EXPECTED_HANDLE(qHandle) ((qHandle.index() << (qHandle.CounterBits + 2)) + (qHandle.counter() << 2))
#else /* Q_LITTLE_ENDIAN */
#define GET_EXPECTED_HANDLE(qHandle) (qHandle.index() + (qHandle.counter() << qHandle.IndexBits))
#endif
Issue 4: QML compiler (code review) The QML compiler used a helper class named LEUInt32 (based on QLEInteger) that always stored the numbers in little endian internally. This class can be safely mixed with native quint32 on little endian systems, but not on big endian. Usually the compiler would warn about type mismatch, but here the code used reinterpret_cast, such as:
quint32 *objectTable = reinterpret_cast<quint32*>(data + qmlUnit->offsetToObjects);
So this was not noticed on build time, but the compiler was crashing. The fix was trivial again, replacing quint32 with QLEUInt32. Issue 5: QModbusPdu, part of Qt Serial Bus module (code review) The code snippet is simple:
QModbusPdu::FunctionCode code = QModbusPdu::Invalid;
if (stream.readRawData((char *) (&code), sizeof(quint8)) != sizeof(quint8))
    return stream;
QModbusPdu::FunctionCode is an enum, so code is a multi-byte value (even if only one byte is significant). However, (char *) (&code) returns a pointer to the first byte of it. It is the needed byte on little endian systems, but it is the wrong byte on big endian ones! The correct fix was using a temporary one-byte variable:
quint8 codeByte = 0;
if (stream.readRawData((char *) (&codeByte), sizeof(quint8)) != sizeof(quint8))
    return stream;
QModbusPdu::FunctionCode code = (QModbusPdu::FunctionCode) codeByte;
Issue 6: qt_is_ascii (code review) This function, as the name says, checks whether a string is ASCII. It does that by splitting the string into 4-byte chunks:
while (ptr + 4 <= end)  
    quint32 data = qFromUnaligned<quint32>(ptr);
    if (data &= 0x80808080U)  
        uint idx = qCountTrailingZeroBits(data);
        ptr += idx / 8;
        return false;
     
    ptr += 4;
 
idx / 8 is the number of trailing zero bytes. However, the bytes which are trailing on little endian are actually leading on big endian! So we can use qCountLeadingZeroBits there. Issue 7: the bundled copy of tinycbor (upstream pull request) Similar to issue 5, the code was reading into the wrong byte:
if (bytesNeeded <= 2)  
    read_bytes_unchecked(it, &it->extra, 1, bytesNeeded);
    if (bytesNeeded == 2)
        it->extra = cbor_ntohs(it->extra);
 
extra has type uint16_t, so it has two bytes. When we need only one byte, we read into the wrong byte, so the resulting number is 256 times higher on big endian than it should be. Adding a temporary one-byte variable fixed it. Issue 8: perfparser, part of Qt Creator (code review) Here it is not trivial to find the issue just looking at the code:
qint32 dataStreamVersion = qToLittleEndian(QDataStream::Qt_DefaultCompiledVersion);
However the linker was producing an error:
undefined reference to QDataStream::Version qbswap(QDataStream::Version)'
On little endian systems, qToLittleEndian is a no-op, but on big endian systems, it is a template function defined for some known types. But it turns out we need to explicitly convert enum values to a simple type, so the fix was passing qint32(QDataStream::Qt_DefaultCompiledVersion) to that function. Issue 9: Qt Personal Information Management (code review) The code in test was trying to represent a number as a sequence of bytes, using reinterpret_cast:
static inline QContactId makeId(const QString &managerName, uint id)
 
    return QContactId(QStringLiteral("qtcontacts:basic%1:").arg(managerName), QByteArray(reinterpret_cast<const char *>(&id), sizeof(uint)));
 
The order of bytes will be different on little endian and big endian systems! The fix was adding this line to the beginning of the function:
id = qToLittleEndian(id);
This will cause the bytes to be reversed on big endian systems. What remains unfixed There are still some bugs, which require deeper investigation, for example: P.S. We are looking for new people to help with maintaining Qt 6. Join our team if you want to do some fun work like described above!

30 December 2020

Andrej Shadura: Making the blog part of the Fediverse and IndieWeb

I ve just made my blog available on the Fediverse, at least partially. Yesterday while browsing Hacker News, I saw Carl Schwan s post Adding comments to your static blog with Mastodon(m) about him replacing Disqus with replies posted at Mastodon. Just on Monday I was thinking, why can t blogs participate in Fediverse? I tried to use WriteFreely as a replacement for Pelican, only to find it very limited, so I thought I might write a gateway to expose the Atom feed using ActivityPub. Turns out, someone already did that: Bridgy, a service connecting websites to Twitter, Mastodon and other social media, also has a Fediverse counterpart, Fed.brid.gy just what I was looking for! I won t go deeply into the details, but just outline issues I had to deal with: The end result is not fully working yet, since posts don t appear yet, but I can be found and followed as @shadura.me@shadura.me. As a bonus I enabled IndieAuth at my website, so I can log into compatible services with just the URL of my website.

29 December 2020

Bastian Venthur: Creating Beautiful Github Streaks

Creating Beautiful Github Streaks Github streaks are a pretty fun and harmless way of visualizing your contributions on github or gitlab. Some people use them to brag a little, some companies use them to check out potential candidates. It s all good as long as everyone is aware how easily you can manipulate them. In order to fake a commit date in the past, you have to set the GIT_AUTHOR_DATE and GIT_COMMITTER_DATE environment variables on git commit both, RFC 2822 and ISO 8601 formats are accepted (for more info see here). When generating hundreds of commits, the --allow-empty flag is very useful, as we don t have to actually change the source tree in order to generate a commit. So the command to generate a fake commit looks like this:
GIT_AUTHOR_DATE="$TIMESTAMP" GIT_COMMITTER_DATE="$TIMESTAMP" git commit --allow-empty -m "$TIMESTAMP"
execute it on an empty git repository and push to github or gitlab and you ll se the result after some time. To automate this, I wrote me a little python script (github, pypi). You can simply install it via:
pip install python-streak
and run it with:
streak
in an empty git repository. By default, streak will create 3 commits per day with a probability of ~71% per commit for 100 years, starting from 1980-05-09. These parameters can be tuned with: Generating a few commits per day for the range of 10 years, took roughly 2 minutes on my computer, and man was I a busy beaver in the 80s! Streak 1989 I know, I m for sure not the first one to discover this kind of thing, but it was a fun afternoon project anyways!

Next.

Previous.